re-identification: revisiting how we define personally identifiable information
DESCRIPTION
research paperTRANSCRIPT
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
1
Imagine it’s a rainy Sunday afternoon. You have just sat down in your living room,
laptop in hand, as you turn on your TV equipped with Netflix ready to resume viewing your
latest TV series obsession. Whilst watching the latest episode, you begin surfing the net and
decide to finally give in to the inner desire to make that online purchase for an item you have
desired for some time. After you place your order the website kindly asks you to rate several past
purchases you have made. You quickly do so and resume watching your show. Shortly after the
show ends, Netflix asks you to rate the episode. Again, you quickly provide a rating score and
carry on with the rest of your afternoon.
In the above scenario, aside from sounding relatively relaxing and familiar, you may have
just left enough information for someone with the right tools to identify who you are. Thus,
potentially connecting you to some of your most sensitive and personal information. Although
this may sound extreme, the reality of re-identification from anonymitized data has become
increasingly pervasive as technology in the computer sciences and mathematics continues to
advance.1 These advancements in technology has made it possible for hackers, criminals,
researchers, and the government alike, to take random anonymous information and, with outside
source information, begin to unlock the doors that lead to your most personal information
including your actual identity.2
The concept of data anonymization in relatively simple and has been in existence since
the advent of digitizing information.3 The concept, although approached in many ways, is
simple; it consists of removing all information that could be seen as personally identifiable. That 1 See Latanya Sweeny, Computational Disclosure Control: A Primer on Data Privacy Protection, Massachusetts Institute of Technology (2001), available at http://dspace.mit.edu/handle/1721.1/8589 2 Arvind Narayanan and Vitaly Shmatikov, Privacy and Security: Myths and Fallacies of “Personally Identifiable Information,” 53 Communications of the ACM (June 2010), available at http://www.cs.utexas.edu/users/shmat/shmat_cacm10.pdf 3 Arvind Narayanan & Vitaly Shmatikov, Robust De-Anonymization of Large Sparse Datasets, in PROC. OF THE 2008 IEEE SYMP. ON SECURITY AND PRIVACY 111, 121 available at http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf [hereinafter Netflix Release]
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
2
is, information that could be linked to sensitive information such as health conditions or directly
linked to your identity such as your Social Security Number or Credit Card Number. The idea
then follows that once this information has been removed or redacted, all that remains is
harmless information such as age, gender, product ratings, etc.4 This information is then
analyzed is some fashion and/or shared publically, to third parties, or internally to enhance
professional practices, academic research, highlight demographics/consumer behavior patterns,
or simply to provide public disclosure of information.5 The end result seems to be a well-
balanced approach of protecting the privacy of individuals whose information is being shared
while allowing for the utility of information to also be protected.
However, while seemingly ideal, this balance operates on an ideological principle that no
longer applies in today’s technological age: personally identifiable information is only
information that is, or is directly linked to, a person’s identity or sensitive information. The
advances in mathematics and computer science have created a way to connect seemingly
harmless and unrelated information, like movie ratings or viewing patterns of a particular
unidentified person, to harmful information, like a specific person’s diagnosis of HIV or mental
illness.6 The privacy right implications to such revelations are damning to the current data
anonymizing approach to privacy protection.
4 See Latanya Sweeney, Achieving k-Anonymity Privacy Protection Using Generalization and Suppression, 10 INT’L J. ON UNCERTAINTY, FUZZINESS & KNOWLEDGE-BASED SYS. 571 (2002) available at http://dataprivacylab.org/dataprivacy/projects/kanonymity/kanonymity2.pdf
5 Ali Inan, Murat Kantarcioglu, Elisa Bertino, “Using Anonymized Data for Classification” University of Texas at Dallas, available at http://www.utdallas.edu/~muratk/publications/inan-AnonClassification.pdf 6 Bambauer, Jane, Tragedy of the Data Commons (March 18, 2011). Harvard Journal of Law and Technology, Vol. 25, 2011. Available at SSRN: http://ssrn.com/abstract=1789749 or http://dx.doi.org/10.2139/ssrn.1789749
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
3
These new advancements carry with them serious privacy right implications including the
need to reconsider the traditional regulatory approach of categorically defining what constitutes
personally identifiable information (PII). This approach examines PII on a spectrum with
personally sensitive information on one end and a person’s identity on the other.7 The current
approach to defining and regulating PII is limited to either end of this spectrum while leaving all
other information that falls between these two ends unregulated and out of the definitional scope
of PII.8
This emerging gap within the definition of PII is the focus of this paper and will be
approached in the following manner. First, Part I of this paper will examine the basic history and
approach to data anonymizing and its impact on defining PII in a categorical manner. Part II will
highlight how the advancements in technology have destroyed the current definitional approach
by highlighting a case study that demonstrates the definition’s pitfalls. Part III will layout a
recent recommendation made in providing a potential solution to this issue. And finally, Part IV
will examine the potential shortcomings of this recommendation by recommending additional
considerations in redefining PII and its subsequent regulatory scheme.
I. Data Anonymizing and the categorical approach to PII
In the current regulatory scheme regarding PII, data anonymizing has been hailed as the
cure all for protecting privacy while ensuring utility of information.9 Under the current legal
framework, privacy is to be protected in that the identity and/or sensitive information of an
7 This is a similar analogy of the hallway anaolgy used by Paul Ohm. See Paul Ohm, Broken Promises of Privacy: Responding to the Surprising Failure of Anonimyzation, 57 UCLA L. Rev. 1710 at 1749-50, 1759-60 (2010), available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 [hereinafter Broken Promises] 8 Id. 9 Douglas J. Sylvester & Sharon Lohr, The Security of Our Secrets: A History of Privacy and Confidentiality in Law and Statistical Practice, 83 DENV. U. L. REV. 147 at 195 (2005), available at http://www.law.du.edu/images/uploads/denver-university-law-review/v83_i1_sylvesterLohr.pdf [hereinafter Security of Our Secrets]
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
4
individual is hidden or removed while other information regarding that person is analyzed and
shared with little or no regard to privacy protection.10 Before looking more deeply into how PII
has gained its traditional definition due to this approach, it serves useful to first understand when
and for what purposes data anonymizing is utilized.
Traditionally information sharing is motivated by the need to share information for
research purposes, such as health findings and statistics being shared among medical academics,
or consumer practices and behaviors being shared between corporations or industries.11
Moreover, potential personal information is often used internally to help demonstrate how
particular practices or techniques are impacting an overall goal.12 For example, a multi-level-
marketing company may want to know if age or gender influences the sales of a particular
product or if the particular strategies utilized by some employees pushes sales higher or lower.
Furthermore, in other cases, and more so recently, information is released for “crowd-
sourcing” purposes.13 As will be highlighted infra Part II, some corporations, like Netflix, or
even entire industries will release information to the general public with the idea that volunteer
analysts (e.g. bloggers, free lance researchers, etc.) will analyze the information for the
corporation for free or for a prize rather than the corporation hiring out a small group of analysts.
Again, these public releases rely heavily on data anonymizing to protect the PII (sensitive and/or
10 Id. 11 Broken Promises, at 1708; See also, Posting of Susan Wojcicki, Vice President, Product Management to The Official Google Blog, Making Ads More Interesting, http://googleblog.blogspot.com/2009/03/making-ads-more- interesting.html 12 Id.; See also, Posting of Philip Lenssen to Google Blogoscoped, Google-Internal Data Restrictions, http://blogoscoped.com/archive/2007-06-27-n27.html 13 Jeff Howe, The Rise of Crowdsourcing, Wired Magazine Issue 14.06 (June 2006), available at http://sistemas-humano-computacionais.wdfiles.com/local--files/capitulo%3Aredes-sociais/Howe_The_Rise_of_Crowdsourcing.pdf
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
5
identity related information) while providing useable information in terms of an overall
purpose.14
These uses of data anonymizing have largely influenced the way PII is viewed and
defined. More specifically PII has been approached in a regulatory sense strictly from a
categorical perspective.15 This is largely because the practice of data anonymizing is centered on
categories of information that are seen as indentifying a person and/or relating to highly sensitive
information about a person. This categorical approach has been mainly influenced by what legal
scholar Paul Ohm highlights as three main types of data anonymizing: suppression,
generalization, and aggregation.16
Data anonymizng can be done in many forms and tracked by the information supplier in
varying degrees. As Paul Ohm points out, the vast majority of information is anonymized and
then forgotten, in a privacy sense. Thus, the main types of data anonymizing all stem from the
same category of the “release-and-forget” approach.17 The “forget” element of this approach
again applying to the tracking of what happens to the information once it is released (i.e. who it
is shared with, how it is treated, where it is shared, etc. is never tracked).
In a deeper look at these three approaches to data anonymizing it becomes very clear how
PII has been defined only by categories of information. For example, suppression is the approach
that perhaps most people would assume anonymizing takes. In suppression the PII is simply
redacted or removed leaving all other information in tact.18 Thus, a company examining the
consumer habits of its online shoppers would redact consumers’ names, street/mailing addresses,
and credit card number while leaving the item purchased, any product rating made, how many 14 National Institutes of Health, HIPAA Privacy Rules for Researchers, http://privacyruleandresearch.nih.gov/faq.asp 15 Security of Our Secrets, at 182. 16 Broken Promises, at 1711-16 17 Id. 18 Id.
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
6
items were purchased, when the purchase was made, what products were viewed, gender and age
of the consumer, and even the zip code to track habits by geographic region. This approach is
premised on the idea that indentifying information is singled out and removed.19 The suppression
method does have inherent concerns because while identifying information is removed many
other pieces of information remain that could lead to the re-identification of a person.20 Thus, in
some cases suppression is not an adequate approach.
To combat this, another approach, generalization, can aid in providing more anonymity
while arguably protecting the utility of the information.21 For example in the consumer example
above the zip code, gender, and age of a consumer could very easily narrow down the specific or
likely shoppers within a particular geographic region. Thus, the online store may want to
generalize the information, like creating an age range rather than specific ages for each purchase
or combine several zip codes to broaden the geographic scope. This approach lowers the risk of
being re-identified but also lowers the quality and specificity of the findings that can be made
from the information.22
Finally, a third approach can be used to help circumvent either pitfall of suppression or
generalization but also comes at a price. Aggregation is the release of specific information that
typically has already been analyzed on some level.23 For example if the online store wanted to
analyze the gender of purchasers toward a particular product the online store could supply the
analysts with two numbers, the amount of men and women that purchased the product. This
19 Broken Promises, at 1713; See alsoClaudio Bettini et al., The Role of Quasi-Identifiers in k-Anonymity Revisited (DICo Univ. Milan Tech. Rep. RT-11-06, July 2006), available at http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDQQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.96.564%26rep%3Drep1%26type%3Dpdf&ei=YnpxUeTYMqerigLQz4GIDQ&usg=AFQjCNEVXlUOVxygHpaZU3TUtxH7fhYIew&sig2=jc0UrSI738JNwemTi7V7zQ&bvm=bv.45373924,d.cGE 20 Id. at 1714; See also Sweeny, supra note 4 21 Id. 22 Id. 23 Id. at 1715
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
7
approach shields the raw data from others creating more privacy protection while still providing
specific information. However, the information may be too specific and thus this approach may
only work when examining specific criterion rather than taking a holistic approach to
information analysis.
The unifying theme through all of these approaches is the fact that data is anonymized by
virtue of categories. In all three approaches the information is assessed for what types of
information are too revealing in terms of identity and/or sensitivity. In no way do these
approaches account for the interplay of information or the interaction of openly available
information and how it may re-identify a person.24 Thus, we see PII is, and always has been,
viewed in a categorical sense which overlooks the current issue of harmless information
providing connections to the harmful invasion of information that stands protected on either end
of the privacy spectrum, albeit identity and sensitive information. As Part II will further explain,
this definitional gap, in what is PII, can create issues that are problematic under a legal scheme
that has yet to anticipate new harms and an appropriate way to address them.
II. PII and the definitional gap: Netflix a Case Study
The definitional gap of PII can be slightly difficult to understand in the abstract because
ratings to online purchases or viewed movies seems to be rather disconnected from your social
security number or medical history. This is typically compounded by the fact that knowing a
person’s shopping or movie preferences seems rather harmless in comparison to their social
security number.25 Thus, it seems useful to highlight a case study that demonstrates the slippery
24 Id. at 1716 25 Nancy Messieh, Do We Really Care About Our Online Privacy?, The Next Web, September 13, 2011, available at http://thenextweb.com/insider/2011/09/13/do-we-really-care-about-our-online-privacy/
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
8
slope that technology has placed privacy rights on, and how ill equipped our regulatory scheme
currently is to handle such potential harm in lieu of its categorical approach to defining PII.26
In October of 2006 Netflix announced a contest with a winning prize of $1 million.27 The
prize would go to the best new mathematical algorithm designed to help Netflix’s computer
servers better predict what movies to recommend to customers based upon the customers
viewing history.28 In order to facilitate this contest Netflix released over one hundred million
records containing the personal viewing history of nearly five hundred thousand customers.29
The sharing of the information was largely motivated by potential financial gains; more
specifically, if Netflix could ascertain a better way to predict movies that its customers would
actually enjoy Netflix stood to fortify its repeat customer rates resulting in a larger profit
margin.30 In order to protect the privacy of the customers, Netflix utilized suppression tactics
and replaced all customer names with non-identifiable ID numbers and redacted all other
traditionally defined PII, namely: credit card numbers, addresses, etc.31
The public release of information was seen as a great opportunity for researchers in the
mathematic and computer science fields, not just for potentially winning the prize, but for the
free access to a large amount of datasets that could be used to test theories and the latest
advances in technology.32 The resultant research that came from the Netflix release would be
more startling however as the implications of these advancements would force many privacy
26 Broken Promises, at 1731-45 27 Id. at 1720-22; See also The Netflix Prize Rules, http://www.netflixprize.com/rules 28 Id.; See also Netflix Prize: FAQ, http://www.netflixprize.com/faq 29 Id.; See also supra note 27 30 Id. 31 Id. 32 Id.
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
9
advocates to reexamine what actually constitutes PII and whether a categorical approach is the
best policy approach for a regulatory scheme.33
Among the research that was produced, a study showing the ability to re-identify
anonymized data to the identity of the actual viewer was some of the more surprising finds.34
The study originally began by showing that if a customer of Netflix was included within the data
that was released for the contest, a person knowing very little about that customer would be able
to identify which viewing history record belonged to that customer’s viewing history and
ultimately connect an identity with a specific set of anonymized data.35 Of course this initially
does not seem like the death knell of privacy via anonymized data in that a customer’s medical
history was not subsequently exposed. However, the findings sparked more research that showed
a deeper undercurrent to what potentially could be connected regarding a person’s identity or
sensitive information once this connection was made.36
For example, researchers took the Netflix data and used mathematical algorithms to cross
reference the viewing histories released by Netflix with the rating history of individual users on
IMDb.com. IMDb.com holds all of its information publically and thus users have opaque
anonymous ID names or numbers and can rate any movie within the database. The interesting
element of this research was the fact that both Netflix and IMDb.com did not contain identical
lists of movies. Therefore, both Netflix and IMDb.com contained information about a viewer’s
movie preferences that the other did not contain. The research that was found showed that from a
small subset of IMDb.com users several were statistical matches for a customer’s viewing
33 Arvind Narayanan & Vitaly Shmatikov, How to Break the Anonymity of the Netflix Prize Dataset, ARVIX, Feb. 5, 2008, at 1, http://arxiv.org/pdf/cs/0610105v2.pdf 34 Id. 35 Id.; See also Justin Vastola et. al., Statistics for Re-Identification in Network Models, University of Pennsylvania, available at https://opimweb.wharton.upenn.edu/linkservid/1BAB25B2-D765-78AD-322BC102A698C73A/showMeta/0/ 36 Id.
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
10
history found in the Netflix release37. What was more telling was the complete picture that, when
combined, both the databases painted of what type of political persuasion and social viewpoints
this viewer had based upon the rating of several movies and TV shows.
Again, this has yet to connect viewing history with medical history or other sensitive
information but it is the very activity that could. As we reconsider the spectrum of privacy the
connections made between all the pieces of information that fall in the middle of the two ends of
the spectrum lead to an eventual connection between the two ends themselves.38
For example, a person’s viewing history may be linked to their social media account
profiles that show the same sets of movies being liked or followed. Through the use of
mathematics and computer science technology a person can then connect the statistical
likelihood of how many people would have liked the same set of movies, eventually leading to a
process of elimination and thus connecting that social media account identity to a similar
viewing history can re-identify someone via seemingly anonymized data.39
Subsequently, once an identity is found and connected to a particular social media
account the person creating the connections has just gained access to a new cache of information
about an individual that can be cross referenced with other databases and more connections can
potentially be made regarding a person’s identity and sensitive information.40 This process
continues to repeat until a solid connection between actual identity and sensitive information is
formed. The advances in technology have not only made this process possible but have also
37 Id. 38 Broken Promies, supra note 7 39 Narayanan, supra note 33 40 Broken Promies, supra note 7
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
11
made seemingly random anonymous data identifiable and connectable at increasingly faster
speeds.41
The over arching implication of these findings is detrimental to the categorical approach
to PII. In particular, PII is defined by classifying categories of information as personal vs. non-
personal. Thus, the regulatory scheme enforces privacy protection by requiring that information
classified as personal be protected and safe-guarded while non-personal information is not.42
However, as demonstrated above in the Netflix release all information has the potential to be
personal information. Thus, maintaining a categorical approach would require a ban on any
information release or sharing running afoul of several laws including the Constitution.
To combat this issue some states’ courts have begun to expand the definition of PII to
include other pieces of information traditionally not viewed as PII.43 However, this still
maintains that there are some pieces of information that could not connect a person’s identity to
or with their sensitive information. As technology continues to advance this will continue to
become less true leaving gaps in the current regulatory scheme in how to with such potential
issues. Part III will examine a specific recommendation to deal with the current approach of
categorically defining PII.
III. Paul Ohm and the Five Step Test
This definitional gap in PII that has been created through connecting seemingly unrelated
pieces of information and tying them back to the identity and/or sensitive information of a person
is problematic for two reasons. First, it forces PII’s definition to be expanded to virtually all
41 Arvind Narayanan et. al, Link Prediction by De-Anonymization: How We Won the Kaggle Network Challenge, arXiv:1102.4374.v1, Feb. 22, 2011, available at http://arxiv.org/pdf/1102.4374v1.pdf 42 Broken Promises, supra note 7 at 1731-44 43 Robert E. Braun and Craig A. Levine, Client Alert: California Supreme Court Rules That Zip Codes Are Personally Identifiable Information, JMBM (Feb. 16, 2011), available at http://www.jmbm.com/docs/california_supreme_court_rules_that_zip_codes_are_personal_identification_information.pdf
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
12
pieces of information, leaving room for sweeping regulation that could greatly harm the
dissemination of information. Second, this issue is taking place under a regulatory scheme that
focuses solely on the information itself.44 It does not consider the collector/sharer or the receiver
of the information and what they are or are not doing to and with the information they obtain.
In reconsidering the definitional gap of PII and these two subsequent issues, Paul Ohm
has created an approach he suggests would fill this gap and create a better “post-anonymization”
regulatory scheme that rethinks the traditional PII approach.45 More specifically, Paul Ohm
argues that a system that has a comprehensive privacy baseline with a more contextualized sector
specific regulatory arm to provide additional above the baseline protection to information will
better fit the privacy reality of today and the future.46 This two-tiered approach is further
supplemented by a Five Factor Test regulators are to apply when considering the second tier of
sector specific regulation, these factors are: 1) Data Handling Techniques, 2) Private versus
Public release, 3) Quantity, 4) Motive, and 5) Trust.47
Before briefly examining the details of the Five Factor test it is important to first briefly
discuss how this approach changes the current categorical approach to defining PII. This new
approach does not require that categories become a thing of the past because while unrelated
information may now have the potential to be connected, and thus swallowing all categories of
information in the operative definition of PII, these categories can still serve as a starting point
that helps craft a more holistic view of the situation. The idea essentially being that who has the
information, what they are doing with it, and how it is being handled, may be categorical
44 Broken Promises, supra note 7 at 1731 45 Id. at 1751 46 Id. at 1762-64 47 Id. at 1764-68
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
13
approached based on how serious the information is viewed to be in terms of privacy.48 It is from
this platform that the Five Factor Test launches the major considerations including categories
that regulators should make when rethinking PII and privacy law in general.
The Five Factor Test begins with Data Technique Handling. More specifically, Paul Ohm
suggests that regulators should consider how categories of information are treated in terms of
privacy and how likely is that technique susceptible to fostering re-identification.49 Thus, the
resultant suggestions is to create a rubric that rates handling techniques on some type of scale
that would then allow regulators to suggest varying scale standards be set specifically for
particular industries and sectors.50
The second and third factors are Private versus Public Releases and Quantity. Private
versus Public Release is just as it sounds, and is directed at the very practice utilized by Netflix
in their 2006 public release of information.51 The idea behind this factor is that public releases
serve very little if any true utilitarian purpose. The driving force is often monetary. Therefore,
regulators should only consider allowing for Public Releases of anonymized data for exceptional
purposes.52 Although not ideal from an information utility standpoint it is privacy at the cost of
some utility.
The third approach, Quantity, takes a similar approach of making common practices
today more exceptional than common in the future. Here, Paul Ohm argues for regulation to take
into account the amount of information one database is allowed to have noting that one stop
shopping for information greatly heightens the likelihood of re-identification.53 Moreover, the
48 Id. at 1759 49 Id. at 1764-68 50 Id. 51 Id. 52 Id. 53 Id.
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
14
longer information is retained also increases this risk. Thus, Paul Ohm suggests that regulators
consider the amount of information and the duration it is kept be regulated.54
Finally the Fourth and Fifth factor, Motive and Trust. Both factors lend themselves to
more of a philosophical ideal that the traditional PII definition and regulatory scheme has not
possessed.55 In particular, Motive suggests that the reason for sharing should be a consideration
in how relaxed or strict the regulation of a particular category of information should be. This
allows flexibility in regulating less harshly more pure pursuits such as academic research versus
monetary motives. Moreover, Trust, much like Motive, considers the adversarial potential of any
given actor and therefore regulation should vary in amount based upon the level of trust we have
for the particular people and institutions that have or will receive information.56
The two-tiered approach and additional five factor test serves as a more comprehensive
approach to the otherwise under or over inclusive potential of the traditional PII definition and
regulatory scheme. However as Part IV will highlight this approach comes with its own
shortcomings and needs additional tweaks in order to more fully approach filling the definitional
gap while also providing a balance for the utility of information.
IV. The Five Factor Test: Analysis and Considerations
The two-tiered approach with its supplementary Five Factor Test is premised on a
fundamental principle that many can agree upon; that is: the traditional definition of PII is rooted
in the idea the information itself can be categorized into potentially harming one’s privacy or
being irrelevant to it.57 As the Netflix Release of 2006 demonstrates this approach is insufficient
and thus the way PII has been defined is ineffective. The new system suggested by Paul Ohm
54 Id. 55 Id. 56 Id. 57 Id. at 1742-43
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
15
correctly shifts the definition of PII away from information and toward how information is
treated and by whom it is collected and shared.58 Thus, the definitional gap of PII is filled by
allowing any information to be PII if it is treated in a certain way or shared with those whom
may use that information to invade one’s privacy; not simply by what kind of information it is
generally. However, this definitional shift requires a new approach to the regulatory scheme that
relies upon the traditional PII definition and while Paul Ohm has crafted a thoughtful approach it
also has several shortcomings that need to be considered and remedied before moving forward in
recalibrating the regulatory scheme as it currently stands.
First, the two-tiered system with the supplemental Five Factor Test relies heavily upon
industry regulation.59 Moreover, this new system is reliant upon regulation that is tailored to
specific industries in different ways. This is problematic for two reasons. One, this type of
approach is susceptible to industry discrimination. Although this is not a likely legal hurdle, it
will remain a political one. This new system will likely lead to many industry hired lobbyist
seeking new legislation to ease their industry specific regulations for a varying number of
reasons. This is not a new political reality. This idea makes this new approach susceptible to
efficiency arguments as well as free market criticisms.
Two, in the same vein of political reality it serves to note that theoretical approaches to
reform are never best served when being examined in a vacuum. Thus, considering the influence
of special interest in U.S. Politics this approach may have a strange result when attempted in
such a political atmosphere. There is the potential that industries could lobby to receive lower
standards than they should be given under the theoretical approach and thus circumventing the
58 Id. at 1764-68 59 Id.
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
16
overall purpose of the reform. Again, this is highly circumstantial but could be the result when
the categorical approach shifts from defining information types to defining industry types.60
To combat these two potential issues we must step back to the theoretical approach
proposed and redesign the definitional shift of PII. Although a categorical approach to
information has its gaps it should not be abandoned in the sense that industry categorization
replaces it as the main definitional determinant. Instead the approach should factor in
information’s re-identification potential under a variety of circumstances to shape how PII is
redefined. Thus, rather than targeting industries target industry practices. This would allow for
less feelings of being “singled out” and foster more “best practices” conversations. Plus it could
thwart, arguable, some special interest by seeming less industry focused to the public and more
procedural focused.
Second, the two-tiered/five factor approach has left a key player out of consideration: the
individual. As noted above, this approach is highly industry centric. Indeed, it is true the majority
of the privacy harms stem from industries and their treatment of information in a way that could
make it become PII. However, the same can be said of the individual whose information is being
collected. Throughout this paper subtle references have been made toward the fact that outside
information is needed to aid those who are attempting to re-identify anonymized information.
This outside information often comes from an individual’s social networking accounts
and other public oriented venues where vast amounts of unique information to that individual is
put on display for some or all to see. The current reformed approached suggested does not
account for this fact at all. It places all the responsibility upon industries and their practices. This
is largely problematic because information may be safely guarded from transferring non-PII to
60 Id.
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
17
PII through all sorts of regulations and mandatory practices but could be undone by the simple
actions of an individual who may unwittingly cause their own information to become PII.61
In order to combat this two things should be added to the suggested reforms. First, a
mechanism must be put into place whereby industries are not held liable or responsible in
instances where seemingly harmless information becomes PII due to the actions of an individual.
For example, in the Netlfix Release if an individual places all their Netflix ratings on their
Facebook page and has no privacy settings activated, thus their page is open to the public, no
industry standard or practice will be able to protect this individual without banning any release or
sharing altogether. This runs into the original definitional problem with PII swallowing all
information within its definition.
Thus, in this case Netflix should not be held liable for the fact that the release caused the
viewing history of this individual to become PII. Of course this is assuming the release in this
example was done between two private parties or met the exceptional requirement under the
Private versus Public Release factor. This paper will not address who, the courts or the
legislature, is best to determine the levels of liability based upon privacy settings and individual
release of information. Suffice it to stand though, that individual responsibility must be factored
into the new regulatory scheme.
Moreover the second issue regarding individuals is the lack of knowledge. Currently, and
potentially under the new system, individuals are not made aware of how their information is
used and subsequently how their very own actions could undermine their own privacy. Thus, the
regulatory scheme must provide some regulation regarding how the industry practices can
interplay with the individual’s practices regarding their information. For example, Facebook
61 Id.
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
18
could have a concise and simple warning that must be clicked through before posting anything
that reminds individuals that information shared without appropriate privacy settings could lead
to re-identification of anonymized data and thus adversely affect the individual’s privacy. It is
important that the new regulatory system institute transparency and education to the individual
wherever possible.
Finally, the proposed system assumes or at least fails to address another unlikely political
reality: an educated legislature. Although this final issue is outside the scope of the PII
definitional issue, it still serves as an appropriate inquiry because who creates and institutes these
regulations will effectively shapes what the new definition of PII will be. In order to accurately
determine which industries should be regulated and on which levels, would require a fairly in
depth, if not specialized, level of knowledge. This knowledge would need to at least include an
understanding of current information collecting and trading practices, a general understanding of
the technology involved, and how potential advances could alter its capability. The regulators in
Paul Ohm’s approach would need to have an assumed level of understanding that is both
adequate and appropriate to make informed decisions on the specifics of regulations created.62
No suggestions are made for what level of knowledge is required to appropriately make these
regulatory decisions or to ensure it exists.
To combat this shortcoming perhaps an additional piece to this two-tiered/five factor
approach needs to be made: expand the duties of the FCC or create a new regulatory agency. In
either case the regulators would not be the legislators themselves who may or may not have the
adequate knowledge to make these decisions but an agency that can be staffed with those who
do. It serves to note again that this paper does not address whether or not the federal government
62 Id.
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013
19
should be involved in this regulation process or whether the courts should be the ones to foster
the paradigm shift. This simply considers it from a largely legislative approach. In the end an
agency staffed with knowledgeable people in regards to the many facets of this issue would
likely be the best regulator when compared to congress itself.
CONCLUSION
The concept of data anonymizing has served as a beacon of security and reassurance as
we have emerged into a modern age filled with technological advances that can enhance and
harm our privacy expectations. However, as these advances continue forward they have eroded
the very security they once stood to protect. The ability to analyze information, cross-reference
it, and re-identify who it belongs to, has given reason for pause as our regulatory framework has
not been designed to handle this new conundrum.
As we approach rethinking this system and its fundamental foundation in a categorical
definition for PII, it is important to maintain this foundation and build from it. As this paper has
shown there are approaches that, with certain enhancements, can begin to form a system that fills
the definitional gap of PII from informational categories to how information is used based on the
interplay of industry and behavioral components; components that can make irrelevant
anonymized information re-identifiable and harmful to an individual’s privacy expectations.