re-identification: revisiting how we define personally identifiable information

19
Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013 1 Imagine it’s a rainy Sunday afternoon. You have just sat down in your living room, laptop in hand, as you turn on your TV equipped with Netflix ready to resume viewing your latest TV series obsession . Whilst watching the latest episode, you begin surfing the net and decide to finally give in to the inner desire to make that online purchase for an item you have desired for some time. After you place your order the website kindly asks you to rate several past purchases you have made. You quickly do so and resume watching your show. Shortly after the show ends, Netflix asks you to rate the episode . Again, you quickly provide a rating score and carry on with the rest of your afternoon. In the above scenario, aside from sounding relatively relaxing and familiar, you may have just left enough information for someone with the right tools to identify who you are. Thus, potentially connecting you to some of your most sensitive and personal information . Although this may sound extreme, the reality of re-identification from anonymitized data has become increasingly pervasive as technology in the computer sciences and mathematics continues to advance. 1 These advancements in technology has made it possible for hackers, criminals, researchers, and the government alike, to take random anonymous information and, with outside source information, begin to unlock the doors that lead to your most personal information including your actual identity. 2 The concept of data anonymization in relatively simple and has been in existence since the advent of digitizing information. 3 The concept, although approached in many ways, is simple; it consists of removing all information that could be seen as personally identifiable. That 1 See Latanya Sweeny, Computational Disclosure Control: A Primer on Data Privacy Protection, Massachusetts Institute of Technology (2001), available at http://dspace.mit.edu/handle/1721.1/8589 2 Arvind Narayanan and Vitaly Shmatikov, Privacy and Security: Myths and Fallacies of “Personally Identifiable Information,” 53 Communications of the ACM (June 2010), available at http://www.cs.utexas.edu/users/shmat/shmat_cacm10.pdf 3 Arvind Narayanan & Vitaly Shmatikov, Robust De-Anonymization of Large Sparse Datasets, in PROC. OF THE 2008 IEEE SYMP. ON SECURITY AND PRIVACY 111, 121 available at http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf [hereinafter Netflix Release]

Upload: medialawguy

Post on 12-Apr-2015

50 views

Category:

Documents


1 download

DESCRIPTION

research paper

TRANSCRIPT

Page 1: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

1

Imagine it’s a rainy Sunday afternoon. You have just sat down in your living room,

laptop in hand, as you turn on your TV equipped with Netflix ready to resume viewing your

latest TV series obsession. Whilst watching the latest episode, you begin surfing the net and

decide to finally give in to the inner desire to make that online purchase for an item you have

desired for some time. After you place your order the website kindly asks you to rate several past

purchases you have made. You quickly do so and resume watching your show. Shortly after the

show ends, Netflix asks you to rate the episode. Again, you quickly provide a rating score and

carry on with the rest of your afternoon.

In the above scenario, aside from sounding relatively relaxing and familiar, you may have

just left enough information for someone with the right tools to identify who you are. Thus,

potentially connecting you to some of your most sensitive and personal information. Although

this may sound extreme, the reality of re-identification from anonymitized data has become

increasingly pervasive as technology in the computer sciences and mathematics continues to

advance.1 These advancements in technology has made it possible for hackers, criminals,

researchers, and the government alike, to take random anonymous information and, with outside

source information, begin to unlock the doors that lead to your most personal information

including your actual identity.2

The concept of data anonymization in relatively simple and has been in existence since

the advent of digitizing information.3 The concept, although approached in many ways, is

simple; it consists of removing all information that could be seen as personally identifiable. That 1 See Latanya Sweeny, Computational Disclosure Control: A Primer on Data Privacy Protection, Massachusetts Institute of Technology (2001), available at http://dspace.mit.edu/handle/1721.1/8589 2 Arvind Narayanan and Vitaly Shmatikov, Privacy and Security: Myths and Fallacies of “Personally Identifiable Information,” 53 Communications of the ACM (June 2010), available at http://www.cs.utexas.edu/users/shmat/shmat_cacm10.pdf 3 Arvind Narayanan & Vitaly Shmatikov, Robust De-Anonymization of Large Sparse Datasets, in PROC. OF THE 2008 IEEE SYMP. ON SECURITY AND PRIVACY 111, 121 available at http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf [hereinafter Netflix Release]

Page 2: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

2

is, information that could be linked to sensitive information such as health conditions or directly

linked to your identity such as your Social Security Number or Credit Card Number. The idea

then follows that once this information has been removed or redacted, all that remains is

harmless information such as age, gender, product ratings, etc.4 This information is then

analyzed is some fashion and/or shared publically, to third parties, or internally to enhance

professional practices, academic research, highlight demographics/consumer behavior patterns,

or simply to provide public disclosure of information.5 The end result seems to be a well-

balanced approach of protecting the privacy of individuals whose information is being shared

while allowing for the utility of information to also be protected.

However, while seemingly ideal, this balance operates on an ideological principle that no

longer applies in today’s technological age: personally identifiable information is only

information that is, or is directly linked to, a person’s identity or sensitive information. The

advances in mathematics and computer science have created a way to connect seemingly

harmless and unrelated information, like movie ratings or viewing patterns of a particular

unidentified person, to harmful information, like a specific person’s diagnosis of HIV or mental

illness.6 The privacy right implications to such revelations are damning to the current data

anonymizing approach to privacy protection.

4 See Latanya Sweeney, Achieving k-Anonymity Privacy Protection Using Generalization and Suppression, 10 INT’L J. ON UNCERTAINTY, FUZZINESS & KNOWLEDGE-BASED SYS. 571 (2002) available at http://dataprivacylab.org/dataprivacy/projects/kanonymity/kanonymity2.pdf

5 Ali Inan, Murat Kantarcioglu, Elisa Bertino, “Using Anonymized Data for Classification” University of Texas at Dallas, available at http://www.utdallas.edu/~muratk/publications/inan-AnonClassification.pdf 6 Bambauer, Jane, Tragedy of the Data Commons (March 18, 2011). Harvard Journal of Law and Technology, Vol. 25, 2011. Available at SSRN: http://ssrn.com/abstract=1789749 or http://dx.doi.org/10.2139/ssrn.1789749

Page 3: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

3

These new advancements carry with them serious privacy right implications including the

need to reconsider the traditional regulatory approach of categorically defining what constitutes

personally identifiable information (PII). This approach examines PII on a spectrum with

personally sensitive information on one end and a person’s identity on the other.7 The current

approach to defining and regulating PII is limited to either end of this spectrum while leaving all

other information that falls between these two ends unregulated and out of the definitional scope

of PII.8

This emerging gap within the definition of PII is the focus of this paper and will be

approached in the following manner. First, Part I of this paper will examine the basic history and

approach to data anonymizing and its impact on defining PII in a categorical manner. Part II will

highlight how the advancements in technology have destroyed the current definitional approach

by highlighting a case study that demonstrates the definition’s pitfalls. Part III will layout a

recent recommendation made in providing a potential solution to this issue. And finally, Part IV

will examine the potential shortcomings of this recommendation by recommending additional

considerations in redefining PII and its subsequent regulatory scheme.

I. Data Anonymizing and the categorical approach to PII

In the current regulatory scheme regarding PII, data anonymizing has been hailed as the

cure all for protecting privacy while ensuring utility of information.9 Under the current legal

framework, privacy is to be protected in that the identity and/or sensitive information of an

7 This is a similar analogy of the hallway anaolgy used by Paul Ohm. See Paul Ohm, Broken Promises of Privacy: Responding to the Surprising Failure of Anonimyzation, 57 UCLA L. Rev. 1710 at 1749-50, 1759-60 (2010), available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 [hereinafter Broken Promises] 8 Id. 9 Douglas J. Sylvester & Sharon Lohr, The Security of Our Secrets: A History of Privacy and Confidentiality in Law and Statistical Practice, 83 DENV. U. L. REV. 147 at 195 (2005), available at http://www.law.du.edu/images/uploads/denver-university-law-review/v83_i1_sylvesterLohr.pdf [hereinafter Security of Our Secrets]

Page 4: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

4

individual is hidden or removed while other information regarding that person is analyzed and

shared with little or no regard to privacy protection.10 Before looking more deeply into how PII

has gained its traditional definition due to this approach, it serves useful to first understand when

and for what purposes data anonymizing is utilized.

Traditionally information sharing is motivated by the need to share information for

research purposes, such as health findings and statistics being shared among medical academics,

or consumer practices and behaviors being shared between corporations or industries.11

Moreover, potential personal information is often used internally to help demonstrate how

particular practices or techniques are impacting an overall goal.12 For example, a multi-level-

marketing company may want to know if age or gender influences the sales of a particular

product or if the particular strategies utilized by some employees pushes sales higher or lower.

Furthermore, in other cases, and more so recently, information is released for “crowd-

sourcing” purposes.13 As will be highlighted infra Part II, some corporations, like Netflix, or

even entire industries will release information to the general public with the idea that volunteer

analysts (e.g. bloggers, free lance researchers, etc.) will analyze the information for the

corporation for free or for a prize rather than the corporation hiring out a small group of analysts.

Again, these public releases rely heavily on data anonymizing to protect the PII (sensitive and/or

10 Id. 11 Broken Promises, at 1708; See also, Posting of Susan Wojcicki, Vice President, Product Management to The Official Google Blog, Making Ads More Interesting, http://googleblog.blogspot.com/2009/03/making-ads-more- interesting.html 12 Id.; See also, Posting of Philip Lenssen to Google Blogoscoped, Google-Internal Data Restrictions, http://blogoscoped.com/archive/2007-06-27-n27.html 13 Jeff Howe, The Rise of Crowdsourcing, Wired Magazine Issue 14.06 (June 2006), available at http://sistemas-humano-computacionais.wdfiles.com/local--files/capitulo%3Aredes-sociais/Howe_The_Rise_of_Crowdsourcing.pdf

Page 5: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

5

identity related information) while providing useable information in terms of an overall

purpose.14

These uses of data anonymizing have largely influenced the way PII is viewed and

defined. More specifically PII has been approached in a regulatory sense strictly from a

categorical perspective.15 This is largely because the practice of data anonymizing is centered on

categories of information that are seen as indentifying a person and/or relating to highly sensitive

information about a person. This categorical approach has been mainly influenced by what legal

scholar Paul Ohm highlights as three main types of data anonymizing: suppression,

generalization, and aggregation.16

Data anonymizng can be done in many forms and tracked by the information supplier in

varying degrees. As Paul Ohm points out, the vast majority of information is anonymized and

then forgotten, in a privacy sense. Thus, the main types of data anonymizing all stem from the

same category of the “release-and-forget” approach.17 The “forget” element of this approach

again applying to the tracking of what happens to the information once it is released (i.e. who it

is shared with, how it is treated, where it is shared, etc. is never tracked).

In a deeper look at these three approaches to data anonymizing it becomes very clear how

PII has been defined only by categories of information. For example, suppression is the approach

that perhaps most people would assume anonymizing takes. In suppression the PII is simply

redacted or removed leaving all other information in tact.18 Thus, a company examining the

consumer habits of its online shoppers would redact consumers’ names, street/mailing addresses,

and credit card number while leaving the item purchased, any product rating made, how many 14 National Institutes of Health, HIPAA Privacy Rules for Researchers, http://privacyruleandresearch.nih.gov/faq.asp 15 Security of Our Secrets, at 182. 16 Broken Promises, at 1711-16 17 Id. 18 Id.

Page 6: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

6

items were purchased, when the purchase was made, what products were viewed, gender and age

of the consumer, and even the zip code to track habits by geographic region. This approach is

premised on the idea that indentifying information is singled out and removed.19 The suppression

method does have inherent concerns because while identifying information is removed many

other pieces of information remain that could lead to the re-identification of a person.20 Thus, in

some cases suppression is not an adequate approach.

To combat this, another approach, generalization, can aid in providing more anonymity

while arguably protecting the utility of the information.21 For example in the consumer example

above the zip code, gender, and age of a consumer could very easily narrow down the specific or

likely shoppers within a particular geographic region. Thus, the online store may want to

generalize the information, like creating an age range rather than specific ages for each purchase

or combine several zip codes to broaden the geographic scope. This approach lowers the risk of

being re-identified but also lowers the quality and specificity of the findings that can be made

from the information.22

Finally, a third approach can be used to help circumvent either pitfall of suppression or

generalization but also comes at a price. Aggregation is the release of specific information that

typically has already been analyzed on some level.23 For example if the online store wanted to

analyze the gender of purchasers toward a particular product the online store could supply the

analysts with two numbers, the amount of men and women that purchased the product. This

19 Broken Promises, at 1713; See alsoClaudio Bettini et al., The Role of Quasi-Identifiers in k-Anonymity Revisited (DICo Univ. Milan Tech. Rep. RT-11-06, July 2006), available at http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDQQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.96.564%26rep%3Drep1%26type%3Dpdf&ei=YnpxUeTYMqerigLQz4GIDQ&usg=AFQjCNEVXlUOVxygHpaZU3TUtxH7fhYIew&sig2=jc0UrSI738JNwemTi7V7zQ&bvm=bv.45373924,d.cGE 20 Id. at 1714; See also Sweeny, supra note 4 21 Id. 22 Id. 23 Id. at 1715

Page 7: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

7

approach shields the raw data from others creating more privacy protection while still providing

specific information. However, the information may be too specific and thus this approach may

only work when examining specific criterion rather than taking a holistic approach to

information analysis.

The unifying theme through all of these approaches is the fact that data is anonymized by

virtue of categories. In all three approaches the information is assessed for what types of

information are too revealing in terms of identity and/or sensitivity. In no way do these

approaches account for the interplay of information or the interaction of openly available

information and how it may re-identify a person.24 Thus, we see PII is, and always has been,

viewed in a categorical sense which overlooks the current issue of harmless information

providing connections to the harmful invasion of information that stands protected on either end

of the privacy spectrum, albeit identity and sensitive information. As Part II will further explain,

this definitional gap, in what is PII, can create issues that are problematic under a legal scheme

that has yet to anticipate new harms and an appropriate way to address them.

II. PII and the definitional gap: Netflix a Case Study

The definitional gap of PII can be slightly difficult to understand in the abstract because

ratings to online purchases or viewed movies seems to be rather disconnected from your social

security number or medical history. This is typically compounded by the fact that knowing a

person’s shopping or movie preferences seems rather harmless in comparison to their social

security number.25 Thus, it seems useful to highlight a case study that demonstrates the slippery

24 Id. at 1716 25 Nancy Messieh, Do We Really Care About Our Online Privacy?, The Next Web, September 13, 2011, available at http://thenextweb.com/insider/2011/09/13/do-we-really-care-about-our-online-privacy/

Page 8: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

8

slope that technology has placed privacy rights on, and how ill equipped our regulatory scheme

currently is to handle such potential harm in lieu of its categorical approach to defining PII.26

In October of 2006 Netflix announced a contest with a winning prize of $1 million.27 The

prize would go to the best new mathematical algorithm designed to help Netflix’s computer

servers better predict what movies to recommend to customers based upon the customers

viewing history.28 In order to facilitate this contest Netflix released over one hundred million

records containing the personal viewing history of nearly five hundred thousand customers.29

The sharing of the information was largely motivated by potential financial gains; more

specifically, if Netflix could ascertain a better way to predict movies that its customers would

actually enjoy Netflix stood to fortify its repeat customer rates resulting in a larger profit

margin.30 In order to protect the privacy of the customers, Netflix utilized suppression tactics

and replaced all customer names with non-identifiable ID numbers and redacted all other

traditionally defined PII, namely: credit card numbers, addresses, etc.31

The public release of information was seen as a great opportunity for researchers in the

mathematic and computer science fields, not just for potentially winning the prize, but for the

free access to a large amount of datasets that could be used to test theories and the latest

advances in technology.32 The resultant research that came from the Netflix release would be

more startling however as the implications of these advancements would force many privacy

26 Broken Promises, at 1731-45 27 Id. at 1720-22; See also The Netflix Prize Rules, http://www.netflixprize.com/rules 28 Id.; See also Netflix Prize: FAQ, http://www.netflixprize.com/faq 29 Id.; See also supra note 27 30 Id. 31 Id. 32 Id.

Page 9: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

9

advocates to reexamine what actually constitutes PII and whether a categorical approach is the

best policy approach for a regulatory scheme.33

Among the research that was produced, a study showing the ability to re-identify

anonymized data to the identity of the actual viewer was some of the more surprising finds.34

The study originally began by showing that if a customer of Netflix was included within the data

that was released for the contest, a person knowing very little about that customer would be able

to identify which viewing history record belonged to that customer’s viewing history and

ultimately connect an identity with a specific set of anonymized data.35 Of course this initially

does not seem like the death knell of privacy via anonymized data in that a customer’s medical

history was not subsequently exposed. However, the findings sparked more research that showed

a deeper undercurrent to what potentially could be connected regarding a person’s identity or

sensitive information once this connection was made.36

For example, researchers took the Netflix data and used mathematical algorithms to cross

reference the viewing histories released by Netflix with the rating history of individual users on

IMDb.com. IMDb.com holds all of its information publically and thus users have opaque

anonymous ID names or numbers and can rate any movie within the database. The interesting

element of this research was the fact that both Netflix and IMDb.com did not contain identical

lists of movies. Therefore, both Netflix and IMDb.com contained information about a viewer’s

movie preferences that the other did not contain. The research that was found showed that from a

small subset of IMDb.com users several were statistical matches for a customer’s viewing

33 Arvind Narayanan & Vitaly Shmatikov, How to Break the Anonymity of the Netflix Prize Dataset, ARVIX, Feb. 5, 2008, at 1, http://arxiv.org/pdf/cs/0610105v2.pdf 34 Id. 35 Id.; See also Justin Vastola et. al., Statistics for Re-Identification in Network Models, University of Pennsylvania, available at https://opimweb.wharton.upenn.edu/linkservid/1BAB25B2-D765-78AD-322BC102A698C73A/showMeta/0/ 36 Id.

Page 10: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

10

history found in the Netflix release37. What was more telling was the complete picture that, when

combined, both the databases painted of what type of political persuasion and social viewpoints

this viewer had based upon the rating of several movies and TV shows.

Again, this has yet to connect viewing history with medical history or other sensitive

information but it is the very activity that could. As we reconsider the spectrum of privacy the

connections made between all the pieces of information that fall in the middle of the two ends of

the spectrum lead to an eventual connection between the two ends themselves.38

For example, a person’s viewing history may be linked to their social media account

profiles that show the same sets of movies being liked or followed. Through the use of

mathematics and computer science technology a person can then connect the statistical

likelihood of how many people would have liked the same set of movies, eventually leading to a

process of elimination and thus connecting that social media account identity to a similar

viewing history can re-identify someone via seemingly anonymized data.39

Subsequently, once an identity is found and connected to a particular social media

account the person creating the connections has just gained access to a new cache of information

about an individual that can be cross referenced with other databases and more connections can

potentially be made regarding a person’s identity and sensitive information.40 This process

continues to repeat until a solid connection between actual identity and sensitive information is

formed. The advances in technology have not only made this process possible but have also

37 Id. 38 Broken Promies, supra note 7 39 Narayanan, supra note 33 40 Broken Promies, supra note 7

Page 11: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

11

made seemingly random anonymous data identifiable and connectable at increasingly faster

speeds.41

The over arching implication of these findings is detrimental to the categorical approach

to PII. In particular, PII is defined by classifying categories of information as personal vs. non-

personal. Thus, the regulatory scheme enforces privacy protection by requiring that information

classified as personal be protected and safe-guarded while non-personal information is not.42

However, as demonstrated above in the Netflix release all information has the potential to be

personal information. Thus, maintaining a categorical approach would require a ban on any

information release or sharing running afoul of several laws including the Constitution.

To combat this issue some states’ courts have begun to expand the definition of PII to

include other pieces of information traditionally not viewed as PII.43 However, this still

maintains that there are some pieces of information that could not connect a person’s identity to

or with their sensitive information. As technology continues to advance this will continue to

become less true leaving gaps in the current regulatory scheme in how to with such potential

issues. Part III will examine a specific recommendation to deal with the current approach of

categorically defining PII.

III. Paul Ohm and the Five Step Test

This definitional gap in PII that has been created through connecting seemingly unrelated

pieces of information and tying them back to the identity and/or sensitive information of a person

is problematic for two reasons. First, it forces PII’s definition to be expanded to virtually all

41 Arvind Narayanan et. al, Link Prediction by De-Anonymization: How We Won the Kaggle Network Challenge, arXiv:1102.4374.v1, Feb. 22, 2011, available at http://arxiv.org/pdf/1102.4374v1.pdf 42 Broken Promises, supra note 7 at 1731-44 43 Robert E. Braun and Craig A. Levine, Client Alert: California Supreme Court Rules That Zip Codes Are Personally Identifiable Information, JMBM (Feb. 16, 2011), available at http://www.jmbm.com/docs/california_supreme_court_rules_that_zip_codes_are_personal_identification_information.pdf

Page 12: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

12

pieces of information, leaving room for sweeping regulation that could greatly harm the

dissemination of information. Second, this issue is taking place under a regulatory scheme that

focuses solely on the information itself.44 It does not consider the collector/sharer or the receiver

of the information and what they are or are not doing to and with the information they obtain.

In reconsidering the definitional gap of PII and these two subsequent issues, Paul Ohm

has created an approach he suggests would fill this gap and create a better “post-anonymization”

regulatory scheme that rethinks the traditional PII approach.45 More specifically, Paul Ohm

argues that a system that has a comprehensive privacy baseline with a more contextualized sector

specific regulatory arm to provide additional above the baseline protection to information will

better fit the privacy reality of today and the future.46 This two-tiered approach is further

supplemented by a Five Factor Test regulators are to apply when considering the second tier of

sector specific regulation, these factors are: 1) Data Handling Techniques, 2) Private versus

Public release, 3) Quantity, 4) Motive, and 5) Trust.47

Before briefly examining the details of the Five Factor test it is important to first briefly

discuss how this approach changes the current categorical approach to defining PII. This new

approach does not require that categories become a thing of the past because while unrelated

information may now have the potential to be connected, and thus swallowing all categories of

information in the operative definition of PII, these categories can still serve as a starting point

that helps craft a more holistic view of the situation. The idea essentially being that who has the

information, what they are doing with it, and how it is being handled, may be categorical

44 Broken Promises, supra note 7 at 1731 45 Id. at 1751 46 Id. at 1762-64 47 Id. at 1764-68

Page 13: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

13

approached based on how serious the information is viewed to be in terms of privacy.48 It is from

this platform that the Five Factor Test launches the major considerations including categories

that regulators should make when rethinking PII and privacy law in general.

The Five Factor Test begins with Data Technique Handling. More specifically, Paul Ohm

suggests that regulators should consider how categories of information are treated in terms of

privacy and how likely is that technique susceptible to fostering re-identification.49 Thus, the

resultant suggestions is to create a rubric that rates handling techniques on some type of scale

that would then allow regulators to suggest varying scale standards be set specifically for

particular industries and sectors.50

The second and third factors are Private versus Public Releases and Quantity. Private

versus Public Release is just as it sounds, and is directed at the very practice utilized by Netflix

in their 2006 public release of information.51 The idea behind this factor is that public releases

serve very little if any true utilitarian purpose. The driving force is often monetary. Therefore,

regulators should only consider allowing for Public Releases of anonymized data for exceptional

purposes.52 Although not ideal from an information utility standpoint it is privacy at the cost of

some utility.

The third approach, Quantity, takes a similar approach of making common practices

today more exceptional than common in the future. Here, Paul Ohm argues for regulation to take

into account the amount of information one database is allowed to have noting that one stop

shopping for information greatly heightens the likelihood of re-identification.53 Moreover, the

48 Id. at 1759 49 Id. at 1764-68 50 Id. 51 Id. 52 Id. 53 Id.

Page 14: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

14

longer information is retained also increases this risk. Thus, Paul Ohm suggests that regulators

consider the amount of information and the duration it is kept be regulated.54

Finally the Fourth and Fifth factor, Motive and Trust. Both factors lend themselves to

more of a philosophical ideal that the traditional PII definition and regulatory scheme has not

possessed.55 In particular, Motive suggests that the reason for sharing should be a consideration

in how relaxed or strict the regulation of a particular category of information should be. This

allows flexibility in regulating less harshly more pure pursuits such as academic research versus

monetary motives. Moreover, Trust, much like Motive, considers the adversarial potential of any

given actor and therefore regulation should vary in amount based upon the level of trust we have

for the particular people and institutions that have or will receive information.56

The two-tiered approach and additional five factor test serves as a more comprehensive

approach to the otherwise under or over inclusive potential of the traditional PII definition and

regulatory scheme. However as Part IV will highlight this approach comes with its own

shortcomings and needs additional tweaks in order to more fully approach filling the definitional

gap while also providing a balance for the utility of information.

IV. The Five Factor Test: Analysis and Considerations

The two-tiered approach with its supplementary Five Factor Test is premised on a

fundamental principle that many can agree upon; that is: the traditional definition of PII is rooted

in the idea the information itself can be categorized into potentially harming one’s privacy or

being irrelevant to it.57 As the Netflix Release of 2006 demonstrates this approach is insufficient

and thus the way PII has been defined is ineffective. The new system suggested by Paul Ohm

54 Id. 55 Id. 56 Id. 57 Id. at 1742-43

Page 15: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

15

correctly shifts the definition of PII away from information and toward how information is

treated and by whom it is collected and shared.58 Thus, the definitional gap of PII is filled by

allowing any information to be PII if it is treated in a certain way or shared with those whom

may use that information to invade one’s privacy; not simply by what kind of information it is

generally. However, this definitional shift requires a new approach to the regulatory scheme that

relies upon the traditional PII definition and while Paul Ohm has crafted a thoughtful approach it

also has several shortcomings that need to be considered and remedied before moving forward in

recalibrating the regulatory scheme as it currently stands.

First, the two-tiered system with the supplemental Five Factor Test relies heavily upon

industry regulation.59 Moreover, this new system is reliant upon regulation that is tailored to

specific industries in different ways. This is problematic for two reasons. One, this type of

approach is susceptible to industry discrimination. Although this is not a likely legal hurdle, it

will remain a political one. This new system will likely lead to many industry hired lobbyist

seeking new legislation to ease their industry specific regulations for a varying number of

reasons. This is not a new political reality. This idea makes this new approach susceptible to

efficiency arguments as well as free market criticisms.

Two, in the same vein of political reality it serves to note that theoretical approaches to

reform are never best served when being examined in a vacuum. Thus, considering the influence

of special interest in U.S. Politics this approach may have a strange result when attempted in

such a political atmosphere. There is the potential that industries could lobby to receive lower

standards than they should be given under the theoretical approach and thus circumventing the

58 Id. at 1764-68 59 Id.

Page 16: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

16

overall purpose of the reform. Again, this is highly circumstantial but could be the result when

the categorical approach shifts from defining information types to defining industry types.60

To combat these two potential issues we must step back to the theoretical approach

proposed and redesign the definitional shift of PII. Although a categorical approach to

information has its gaps it should not be abandoned in the sense that industry categorization

replaces it as the main definitional determinant. Instead the approach should factor in

information’s re-identification potential under a variety of circumstances to shape how PII is

redefined. Thus, rather than targeting industries target industry practices. This would allow for

less feelings of being “singled out” and foster more “best practices” conversations. Plus it could

thwart, arguable, some special interest by seeming less industry focused to the public and more

procedural focused.

Second, the two-tiered/five factor approach has left a key player out of consideration: the

individual. As noted above, this approach is highly industry centric. Indeed, it is true the majority

of the privacy harms stem from industries and their treatment of information in a way that could

make it become PII. However, the same can be said of the individual whose information is being

collected. Throughout this paper subtle references have been made toward the fact that outside

information is needed to aid those who are attempting to re-identify anonymized information.

This outside information often comes from an individual’s social networking accounts

and other public oriented venues where vast amounts of unique information to that individual is

put on display for some or all to see. The current reformed approached suggested does not

account for this fact at all. It places all the responsibility upon industries and their practices. This

is largely problematic because information may be safely guarded from transferring non-PII to

60 Id.

Page 17: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

17

PII through all sorts of regulations and mandatory practices but could be undone by the simple

actions of an individual who may unwittingly cause their own information to become PII.61

In order to combat this two things should be added to the suggested reforms. First, a

mechanism must be put into place whereby industries are not held liable or responsible in

instances where seemingly harmless information becomes PII due to the actions of an individual.

For example, in the Netlfix Release if an individual places all their Netflix ratings on their

Facebook page and has no privacy settings activated, thus their page is open to the public, no

industry standard or practice will be able to protect this individual without banning any release or

sharing altogether. This runs into the original definitional problem with PII swallowing all

information within its definition.

Thus, in this case Netflix should not be held liable for the fact that the release caused the

viewing history of this individual to become PII. Of course this is assuming the release in this

example was done between two private parties or met the exceptional requirement under the

Private versus Public Release factor. This paper will not address who, the courts or the

legislature, is best to determine the levels of liability based upon privacy settings and individual

release of information. Suffice it to stand though, that individual responsibility must be factored

into the new regulatory scheme.

Moreover the second issue regarding individuals is the lack of knowledge. Currently, and

potentially under the new system, individuals are not made aware of how their information is

used and subsequently how their very own actions could undermine their own privacy. Thus, the

regulatory scheme must provide some regulation regarding how the industry practices can

interplay with the individual’s practices regarding their information. For example, Facebook

61 Id.

Page 18: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

18

could have a concise and simple warning that must be clicked through before posting anything

that reminds individuals that information shared without appropriate privacy settings could lead

to re-identification of anonymized data and thus adversely affect the individual’s privacy. It is

important that the new regulatory system institute transparency and education to the individual

wherever possible.

Finally, the proposed system assumes or at least fails to address another unlikely political

reality: an educated legislature. Although this final issue is outside the scope of the PII

definitional issue, it still serves as an appropriate inquiry because who creates and institutes these

regulations will effectively shapes what the new definition of PII will be. In order to accurately

determine which industries should be regulated and on which levels, would require a fairly in

depth, if not specialized, level of knowledge. This knowledge would need to at least include an

understanding of current information collecting and trading practices, a general understanding of

the technology involved, and how potential advances could alter its capability. The regulators in

Paul Ohm’s approach would need to have an assumed level of understanding that is both

adequate and appropriate to make informed decisions on the specifics of regulations created.62

No suggestions are made for what level of knowledge is required to appropriately make these

regulatory decisions or to ensure it exists.

To combat this shortcoming perhaps an additional piece to this two-tiered/five factor

approach needs to be made: expand the duties of the FCC or create a new regulatory agency. In

either case the regulators would not be the legislators themselves who may or may not have the

adequate knowledge to make these decisions but an agency that can be staffed with those who

do. It serves to note again that this paper does not address whether or not the federal government

62 Id.

Page 19: Re-Identification: Revisiting How We Define Personally Identifiable Information

Jeffrey Van Hulten, Re-identification: Revisiting how we define Personally Identifiable Information Cyberlaw--2013

19

should be involved in this regulation process or whether the courts should be the ones to foster

the paradigm shift. This simply considers it from a largely legislative approach. In the end an

agency staffed with knowledgeable people in regards to the many facets of this issue would

likely be the best regulator when compared to congress itself.

CONCLUSION

The concept of data anonymizing has served as a beacon of security and reassurance as

we have emerged into a modern age filled with technological advances that can enhance and

harm our privacy expectations. However, as these advances continue forward they have eroded

the very security they once stood to protect. The ability to analyze information, cross-reference

it, and re-identify who it belongs to, has given reason for pause as our regulatory framework has

not been designed to handle this new conundrum.

As we approach rethinking this system and its fundamental foundation in a categorical

definition for PII, it is important to maintain this foundation and build from it. As this paper has

shown there are approaches that, with certain enhancements, can begin to form a system that fills

the definitional gap of PII from informational categories to how information is used based on the

interplay of industry and behavioral components; components that can make irrelevant

anonymized information re-identifiable and harmful to an individual’s privacy expectations.