does fair anonymization exist?

26
This article was downloaded by: [University of Bath] On: 06 October 2014, At: 12:16 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Click for updates International Review of Law, Computers & Technology Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/cirl20 Does fair anonymization exist? Zoltán Alexin a a Department of Software Engineering, University of Szeged, Szeged, Hungary Published online: 02 Jan 2014. To cite this article: Zoltán Alexin (2014) Does fair anonymization exist?, International Review of Law, Computers & Technology, 28:1, 21-44, DOI: 10.1080/13600869.2013.869909 To link to this article: http://dx.doi.org/10.1080/13600869.2013.869909 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Upload: zoltan

Post on 14-Feb-2017

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Does fair anonymization exist?

This article was downloaded by: [University of Bath]On: 06 October 2014, At: 12:16Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Click for updates

International Review of Law,Computers & TechnologyPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/cirl20

Does fair anonymization exist?Zoltán Alexina

a Department of Software Engineering, University of Szeged,Szeged, HungaryPublished online: 02 Jan 2014.

To cite this article: Zoltán Alexin (2014) Does fair anonymization exist?, International Review ofLaw, Computers & Technology, 28:1, 21-44, DOI: 10.1080/13600869.2013.869909

To link to this article: http://dx.doi.org/10.1080/13600869.2013.869909

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Page 2: Does fair anonymization exist?

Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 3: Does fair anonymization exist?

Does fair anonymization exist?

Zoltan Alexin∗

Department of Software Engineering, University of Szeged, Szeged, Hungary

(Received 9 August 2013; accepted 25 November 2013)

Anonymization is viewed as an instrument by which personal data can be rendered sothat it can be processed further without harming data subjects’ private lives, forpurposes that are beneficial to the public good. The anonymization is fair if thepossibility of re-identification can be practically excluded. The data processor does allthat he or she can to ensure this. For a fair anonymization, simply removing theprimary personal identification data, such as the name, resident address, phonenumber and email address, is not enough, as many papers have warned. Therefore,new guidance documents, and even legal rulings such as the HIPAA Privacy Rule onde-identification, may improve the security of anonymization. Researchers arecontinuously testing the efficiency of the methods and simulating re-identificationattacks. Since the US and Canada do not have a population registry, re-identificationexperiments were carried out with the help of other publicly available databases, suchas census data or the voters’ database. Unfortunately, neither of these is complete andsufficiently detailed, so the computed risk was only an estimate. The author obtainedthe zip code, gender, date of birth distribution data from the Hungarian populationregistry and computed re-identification risks in several simulated cases. This paperalso gives an insight into the legal environment of Hungarian personal medical dataprotection legislation.

Keywords: medical databases; anonymization; risk of re-identification

1. Introduction

The word anonymization originates from the Greek word anvnymıa (anonymia) whichmeans without a name or nameless. The Oxford English Dictionary (OED) tells us that itwas first used in 1972 by Sir Alan Marre, the UK’s Parliamentary Ombudsman (‘I nowlay before Parliament . . . the full but anonymised texts of . . . reports on individualcases.’). According to the OED, the usage of the word is chiefly medical.1 Since currentanonymization methods often produce the kind of databases that retain some risk of re-identification, it would be more correct to use the term de-identification instead(Sweeney 2000). However, anonymization is still widely used in the scientific literaturewhen the aim of the data transformation is to minimize the re-identification risk.

When doing medical research, patients’ privacy must be respected. Respecting privacyis a legal and moral obligation. From this it is concluded that medical data should be anon-ymized before it is shared with researchers. In the ideal case, the anonymization procedureprotects the privacy of the patients as practically nobody can re-identify them. At the same

# 2014 Taylor & Francis

∗Email: [email protected] article was originally published with errors. This version has been corrected. Please see Erratum(doi: 10.1080/13600869.2014.884534)

International Review of Law, Computers & Technology, 2014Vol. 28, No. 1, 21–44, http://dx.doi.org/10.1080/13600869.2013.869909

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 4: Does fair anonymization exist?

time researchers can obtain the necessary data for discovering new treatment methods orevaluating the efficiency of the existing ones.

Anonymization plays an important role when physicians discuss medical cases, forexample at conferences or in scientific papers. While a sentence like ‘Mr. John Smithdied of AIDS in London on 3rd July 2012.’ probably breaches the medical confidentialityrequirement, its anonymized counterpart ‘A man died of AIDS in London on 3rd July2012.’ does not, but it needs further investigation.

From the time when, medical institutions have been storing treatment data electroni-cally; a large amount of information has been accumulated. This mass of data could be apotential resource for medical research. A new kind of medical research has nowevolved that can process this quantity and type of information. New methods such asdata mining, information extraction and knowledge discovery have appeared. This kindof data processing is not being carried out by individual scientists, but by computercenters of an international network of research institutes. Mathematicians, physicists, biol-ogists, chemists, pharmacists and even linguists are potential participants in a medicalresearch project. Can patients’ privacy be preserved in such an environment?

Through anonymization, data can be removed from under the umbrella of data protec-tion law and ethical regulations. Researchers have a strong interest to state that they are pro-cessing anonymous data. They will then have a free hand when performing a researchstudy; no one will ask them to where the data were transferred, who got access to thedata, how long they retained the data, or how the different databases were linked together.

The re-identification of databases that were previously said to be ‘anonymous’ has hap-pened many times. This fact reminds us that fair data processing is obligatory under EUlaw.2 The Internet, where people post personal profiles, pictures, letters, blogs andtweets, provides a means for re-identification. There are several risky combinations ofdemographics, such as a zip code (or postcode) and date of birth, upon which the largeportion of the population can be uniquely identified. Many countries are aware of thisrisk and have introduced precautionary measures intended to control the use of such com-binations of identifying characteristics.

In Hungary, such an awareness cannot be observed. Medical research databases allcontain a zip code, date of birth, gender and many other quasi-identifiers. Researcherscan get access to the data without any further safeguards. Datasets are said to be anon-ymous, so the requirements of data protection and medical ethics are waived.

The author was able to obtain statistical distribution data from the Hungarian populationregistry and with the help of the data he computed the re-identification risks in different set-tings, in different geographical locations and in different age groups. He calculated the re-identification risk values when data sets were generalized, e.g. when the year of birth wasused instead of date of birth. By modifying the current legislation the re-identification risk –hence the potential dangers to privacy of data subjects – could be substantially reduced.Hungarian legislation philosophy differs from the EU approach; this could be another jus-tification for the need to amend current Hungarian regulation.

2. Related articles on anonymization

After several successful re-identification attempts in the US, it became apparent that properanonymization requires stricter rules. As regards, the new federal law on medical data,called HIPAA (Health Insurance Portability an Accountability Act),3 the HHS (Departmentof Health and Human Services) developed the so-called HIPAA Privacy Rule in 2000(Evans 2011).4 It was the first legal ruling that contained an explicit description of an

22 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 5: Does fair anonymization exist?

anonymization technique that could be encoded in a computer program. The Privacy Rulewas included in the appendix of the HIPAA law. The rule is still being periodically checkedand refined according to the new findings in the US Census data.5

2.1. The robust anonymization process

The anonymization procedure should not be excessively costly. It should be organized insuch a way that the staff at the IT departments of the clinics or even a physician canoversee, understand, and perform it. At the same time it should minimize the re-identificationrisk. The HIPAA privacy rule tries to meet this requirement. It provides two methods for ren-dering databases anonymous: the expert determination and the so-called Safe Harbormethods (see Office for Civil Rights, Guidance Regarding Methods for De-identificationof Protected Health Information 2012). According to the HIPAA law, an expert is aperson with appropriate knowledge of generally accepted statistical and scientific principlesand a method for rendering information not individually identifiable.

The Safe Harbor method is an easy and robust way that simply removes 18 categories ofdata from the data sets as well as generalizes the zip codes and the dates. The method allowsone to retain just the first three digits of the five-digit zip codes. All elements of dates are tobe deleted (except the year) for dates that are directly related to an individual, including thebirth date, admission date, discharge date, date of death, and all ages over 89 and all elementsof dates (including the year) indicative of such age, except that such ages and elements maybe aggregated into a single category of age 90 or older. The Safe Harbor method is clearlyrobust and understandable by personnel that have a higher education degree.

Generalization means decreasing the accuracy of data items. In the case of the dates, theremoval of day or the day and month are viewed as generalizations. For numeric data, gen-eralizations are banding, mapping values in the given interval to a representative value, i.e.mapping all body weights between 60 kg and 69.9 kg to 65 kg or to a category of 60–70 kg.Generalization decreases the re-identification risk. If two people were distinguishable bytheir date of birth (e.g. 23/05/1992, 20/11/1992), after generalization to year of birth(1992, 1992) they might not be. Simple generalization means generalizing just one dataitem such as the date of birth to the year of birth, while compound generalization meansthe simultaneous generalization of two or more data items. If certain records can be distin-guished from all the other records even after a generalization, they can be suppressed, i.e.excluded in whole or in part from the final database.

Paul Ohm (2010) presented three re-identification scandals that questioned the applica-bility of such a robust way of anonymization. The AOL Research database, produced in2006, contained web search queries that summarized three months of activity from650,000 users. The Netflix movie rental service publicly released 100 million recordsthat revealed how nearly 500,000 users had rated movies between December 1999 andDecember 2005. In both cases, users were denoted by integer numbers, and no other demo-graphics were included in the databases. Two reporters from the New York Times quicklyidentified an elderly woman in the AOL database based on her queries, and she lateracknowledged that she had authored the queries. The vulnerability of the Netflix databasewas demonstrated by two researchers, Narayanan and Shmatikov from the University ofTexas.6 They statistically proved that anyone who knows a bit about the individual subscri-ber can easily identify the subscriber’s records if they are present in the Netflix database.They actually identified two subscribers by obtaining external data from the IMDbwebsite.7 They examined only 50 IMDb users, not wanting to breach the privacy policyof the website.

International Review of Law, Computers & Technology 23

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 6: Does fair anonymization exist?

According to Paul Ohm, the era of robust anonymization is over because the removal ofpredefined categories of data cannot prevent re-identification. This follows from exampleslike those above. Candidate research datasets must be cautiously and rigorously analyzedby statistician experts before release.

The Information Commissioner for the UK published guidance on anonymizationentitled ‘Anonymization: managing data protection risk, code of practice’ (see InformationCommissioner’s Office 2012). The document proposes a decision diagram that can helpdata administrators to decide if they are allowed to publish a database. If the law forbidsreleasing a database in personally identifiable form, then it needs to be anonymized before-hand. The document, in harmony with Paul Ohm’s article, requires carefully estimating there-identification risk before formally stating that the datasets are anonymous.8

The amount of external data that is publicly available from individuals on the Internet issteadily growing, which has led to an increased re-identification risk. An adversary mostlikely would apply this external knowledge if he wants to attack a database. Data adminis-trators should be especially careful when releasing medical data and be more aware of theassociated re-identification risk.

2.2. The release-and-forget approach

As the name suggests, it means that data controllers first anonymize the database thenrelease the records publicly, privately, or internally, and then forget about it all. Theymake no attempt to follow what happens to the dataset after its release. If datasets werenot appropriately anonymized, they will carry a certain re-identification risk. Should thisrisk come to light, data users cannot be reached, and the data administrator is not in the pos-ition to apply any retrospective control on the use of data. Datasets may already be shared,sold, or included in some software product.

Khaled El Emam and his colleagues (2009) published a list of possible privacymeasures in the appendix of their paper that the Canadian Institute for Health Information,Canadian Institute for Health Research, Canadian Organization for Advancement of Com-puter in Health and others advise one should impose on data users when a transferreddataset is not (or in the interest of the research cannot be) sufficiently anonymized. Datasetsare transferred to the recipients only if they formally agree to abide by these additional rules,including, say, the nomination of a privacy officer, written data-sharing agreement betweencollaborators, written privacy policy, sanctions on breaching confidentiality rules, the del-etion of data after a certain retention period, the physical security of rooms, devices, andbuildings, and a surprise inspection.

3. Medical ethics and data governance

The World Medical Association (WMA) was founded after the Second World War, on 17September 1947, when physicians from 27 different victorious countries met in Paris. Theorganization was created to ensure the independence of physicians, and to work for thehighest possible standards of ethical behavior. Currently, it has 102 member countries.The WMA has developed the Helsinki Declaration adopted by the General Assembly inJune 1964 on Ethical Principles for Medical Research Involving Human Subjects.9 Thisdocument originated from the ten points of the Nuremberg Code10 that required informedconsent from potential research subjects involved in a medical experiment. The HelsinkiDeclaration introduced ethics committees mandating them to approve research plans, byjudging their scientific qualities, the level of protection of research subjects, qualifications

24 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 7: Does fair anonymization exist?

of researchers, and so on. The Helsinki Declaration is not a legal obligation. The Council ofEurope’s Oviedo Treaty was developed from the declaration.11

The Helsinki Declaration recognizes the self-determination of patients and containsguarantees to ensure it. Paragraph 22 states that participation in medical research is volun-tarily, and nobody can be enrolled until he or she freely agrees to it. Paragraph 23 says thatevery precaution should be taken to protect the privacy of research subjects and the confi-dentiality of their personal information. Anonymization of the data before sharing it amongresearchers could be one such precautionary measure. Paragraph 25 concerns the collectionand/or analysis of identifiable medical samples (e.g. blood or tissue) or data. In this specialcase, consent can be waived by the ethics committee.12

Frequently, when researchers ask for an anonymized excerpt from a medical treatmentdatabase they request a waiver for obtaining informed consent and the ethics board usuallygives it to them. In one recent public health study, IRBs (Institutional Review Boards)refused to provide only seven of 153 (4.6%) of the requested medical records and insistedthat patient consent was required (Cutrona et al. 2012). The author is a member in a regionalmedical REC (Research Ethics Committee) and his own personal experiences bear this out.

Barbara J. Evans (2011) presented some open questions concerning data ownership.Sensitive medical data are generally subject to the informational self-determination ofpatients. When a community decides that it will restrict patient rights to informationalself-determination so that researchers can obtain the necessary data, the approval processand the research study both have to be controlled by this community. This means completeopenness: the work of an REC has to be transparent, like the approval process; the approvedresearch plans should be publicized, and the research findings should be of benefit to thecommunity itself.

Fiona Caldicott was mandated to chair a committee that reviewed the health informationpolicy by the UK, Department of Health (DoH) in 2012. The general trust in medical con-fidentiality was substantially shaken when the government introduced the Connected forHealth network and physicians began to upload sensitive data without obtaining consentfrom patients. If this were not enough, the Information Commissioner’s Office publicizeddata breach cases and examples of heavy fines in the health sector. The committee summar-ized its opinion in Information: To share or not to share? (Caldicott 2013). It defended thecurrent UK practice, where poorly anonymized (the name, and the living address wereremoved) central medical databases are created by the force of the law, but called for stron-ger control and transparency:

The linkage of personal confidential data, which requires a legal basis, or data that has been de-identified, but still carries a high risk that it could be re-identified with reasonable effort, frommore than one organisation for any purpose other than direct care should only be done inspecialist, well-governed, independently scrutinised and accredited environments called accre-dited safe havens.

The report mentioned that both Article 8 of the European Convention on Human Rights(ECHR) and the European Data Protection Directive 95/46/EC require reasonable objec-tions to the disclosure of personal confidential data to be respected. The Health andSocial Care Act 2012 (UK) could not be adequately protected from legal challenge if itwere found to conflict with Article 8 of the ECHR.13

While the authors of the above publications had concerns about privacy protection inmedical database research, two recent legal advancements should also be mentioned. Asa rule of thumb, small cell values (e.g. cells containing values less than or equal to 5)

International Review of Law, Computers & Technology 25

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 8: Does fair anonymization exist?

are not to be communicated in statistical tables related to individuals. This is called smallvalue suppression. In a test case, the UK High Court decided that small value suppressionwas needless because there was no reasonable basis to believe that the information can beused to identify an individual.14 The plaintiff, the ProLife Alliance requested detailed abor-tion statistics from the DoH, UK. The department gave it only a small value-suppressedtable of data, but the ProLife Alliance wanted the complete table, referring to theFreedom of Information Act. The Ombudsman for the UK, the Information Commissioner,and many legal experts expressed their opinions on the case.

The European Court on Human Rights (ECtHR) made a decision in the Gillberg vs.Sweden case No. 41723/06 in 2010.15 Gillberg was a researcher who studied adolescentssuffering from behavioral disorders. He made video recordings of them and communicatedhis findings in scientific journals. He obtained informed consent from the subjects or fromtheir parents and an approval from the local ethics committee. Later, another research teamapproached him and asked for an anonymized copy of his research dataset (recordings) inorder to check his hypotheses and reproduce his results. He argued that video recordingscould not be properly anonymized so he refused the request in order to protect the patients’privacy, but the Swedish Court compelled him to share the data. After he destroyed all hisdata, the Swedish Court imposed a fine and a suspended prison sentence. He appealed to theECtHR, which favored Sweden and ruled that the reproducibility of medical research wasconsidered a higher order social interest. When data subjects give consent to the use of theirsensitive data for medical research purposes, it is understood that a second research teammay also get access to the data for the same purpose if they meet all the confidentialityrequirements.

4. Preceding research on re-identification

Re-identification risk is defined as the probability of identification. In practice, it is the ratioof the number of identifiable persons and the total number of persons (1). For example if therisk is 70%, it means that 70 out of 100 records can be linked to identifiable individuals onaverage. Having 100,000 records and a 70% risk, it means that potentially 70,000 peoplecould be identified. The Canadian data protection commissioner viewed that a 20% riskcould be acceptable with additional privacy measures. See El Emam et al. (2009). Mathe-maticians do not share this opinion and they are inclined to accept a risk of 0.1% or less.

risk = number of identifiable people

number of all people(1)

Perhaps L. Sweeney was the first researcher (Sweeney 2000) to calculate re-identificationrisks of different anonymization methods. She processed the US 1990 Census database thatwas made freely available on the web.16 Three tables (five-digit zip code, place,17 county18)contained the number of inhabitants residing in the given geographical location by agegroups (under 12 years, 12–18 years, 19–24 years, 25–34 years, 35–44 years, 45–54years, 55–64 years, and above 65 years). The database did not contain the exact date ofbirth, but Sweeney was still able to compute a measure of the risk. She found thenumber of possible different values for gender and date of birth in an age group (Q) andcompared it with the number of inhabitants. If the population size was less or equal toQ, then she presumed that all the inhabitants could be uniquely identified; otherwise

26 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 9: Does fair anonymization exist?

nobody could be uniquely identified. This is a very rough estimate because statisticallythere could be uniquely identifiable and uniquely unidentifiable individuals in bothcases. In conclusion, she reported that 87.1% of US citizens could be uniquely identifiedby their five-digit zip code, gender and date of birth.19

Kathleen Benitez and Bradly Malin (2010) simulated re-identification attacks againstmedical databases. They used the United States 2000 Census database and severalvoters’ databases. The US citizens have to register at the local authorities if they want toparticipate in the elections. Most states publicize the voters’ list online, while others distri-bute them upon payment. Some databases contain the name, address and the exact date ofbirth while others only have the name, address and the year of birth. The authors applied astatistical method proposed by Dr. Golle (2006) that assumes members of a group are dis-tributed uniformly at random in a larger group. An individual is as likely to be born on 5February as 6 February. For an aggregated group with n individuals who might correspondto b possible subgroups, or ‘bins’, the number of bins containing i individuals can be foundusing the formula:

fn(i) = ni

( )· b1−n · (b − 1)n−i (2)

Here,ni

( )is the so-called binomial coefficient,20 which denotes the number of different

ways i distinguishable individuals can be selected from n individuals.

n1

( )= n

1= n

n2

( )= n(n − 1)

1 · 2nk

( )= n(n − 1) . . . (n − k + 1)

1 · 2 · . . . · k

When using formula (2), b (number of bins) will correspond to the number of days within ayear, so we can estimate the number of groups (bins) having persons with a common date ofbirth. Given an i integer number (1, 2, 3, . . . ), we can compute the number of different days(bins) when exactly i persons have a common date of birth. This calculation requires onlythe population size in a given year (n) and the length of the year in days (365 or 366).

Benitez and Malin proposed the concept of g-distinct.

g-distinct

An individual is said to be unique when he or she has a combination of characteristics thatno one else has, and we say that an individual is g-distinct if their combination of charac-teristics is identical to g21 (or fewer) other people. Uniqueness is the base case of 1-dis-tinct. In general, g-distinct is the sum of the number of people in bins with i individuals forall 1 ≤ i ≤ g.

In the case of 70 persons born in the same year, we expect that 70×365– 69

×36469≈57.93 individuals will have a unique birth date and (70×69/2) ×365– 69

×36468≈5.49 bins will contain two persons sharing a common date of birth on average.Continuing this calculation, with 70 persons there are 57.93+2×5.49¼68.91 persons onaverage – or 98.44% of the population – who are 2-distinct. In a zip code region having8–10,000 inhabitants, there might be roughly 4–5000 males and 4–5000 females, 5000/70 ≈ 70 of them having the same year of birth.

International Review of Law, Computers & Technology 27

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 10: Does fair anonymization exist?

Benitez and Malin (2010) compared the Limited dataset policy and the Safe Harborpolicy. The Limited dataset policy means that researchers may get more completemedical information (which includes the exact date of birth), while the Safe Harborpolicy means that the data were transformed according to the HIPAA Privacy Rule. Inthe case of the limited dataset policy, they found that 18.7% of the population was 1-distinctand 59.7% of the population was five-distinct, which means a substantial risk for re-identi-fication. In contrast, under the Safe Harbor policy, 0.0003% of the population were 1-dis-tinct and 0.002% were 5-distinct.

In l-diversity: Privacy beyond k-anonymity (see Machanavajjhala et al. 2007) theauthors argued that although k-anonymity criteria provide strong privacy protection, it isnot enough.

k-anonymousA data table is k-anonymous for a certain integer k if it contains no (k21)-distinct

records. This means that for any combination of identifying characteristics (so-calledquasi-identifiers),21 there are at least k records sharing this combination of quasi-identifiers.

If a database is 5-anonymous, then for an arbitrary combination of quasi-identifiersthere exist at least five records in the database that match this query or none. When attempt-ing to link this database to another containing names and demographic data, it can be per-formed with a probability of less than 1/5. Consequently, the re-identification risk for thisdatabase is at most 20%.

When a dataset is k-anonymous and for a certain combination of quasi-identifiers thesensitive data stored in the database is identical for all records, k-anonymity is worthless.An adversary may discover that the same sensitive data concerns all potential individualsand may learn what it is. Machanavajjhala et al. proposed a metric that takes intoaccount the diversity of the sensitive data values beyond k-anonymity, hence we expectit to provide greater protection against re-identification.

Pierangela Samarati (2001) introduced a computer software system called m-argus thatconsecutively applies two transformation steps (generalization or suppression) and evalu-ation on a given database table and iteratively produces an anonymized table. Khaled ElEmam and his colleagues in (El Emam et al. 2009) presented results obtained from applyingthe improved m-argus system. The improvement guaranteed that the software would alwaysfind the globally optimal solution. The researchers anonymized a data table of 94,100 pre-scription records of CHEO (Children’s Hospital in Eastern Ontario), representing 10,364patient visits and 6970 unique patients. The authors cautiously analyzed the possible gen-eralizations, applied as many as possible (e.g. the admission date was generalized toQuarter). 14.9% of the records had 1–4 quasi-identifiers22 suppressed. The re-identificationrisk for 95% of the records was below 33%; 80% of the records had a risk of less than 10%.

Peter Kwok and colleagues (2011) presented a simulated re-identification attack againstan anonymous medical database (15,000 records) with a commercial database purchasedfrom a marketing research firm called InfoUSA (30,000 records). The medical databasewas anonymized by following the HIPAA Privacy Rule. After combining the two tablesthey found 22 matches, but later it was shown that re-identification was successful onlyin two cases. The re-identification risk was in this case only 2/15,000 ¼ 0.013%.

5. The Hungarian zip code system

The basic unit of Hungarian public administration is the county. The first, so-called royalcounties were established during the reign of Stephen I of Hungary (AD 969–1038).After the First World War, Hungary lost two thirds of its territory and the number of

28 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 11: Does fair anonymization exist?

counties fell from 63 to 25. Many counties were cut into half by the new state border. Thecurrently existing structure of counties was created in 1950 when the smaller counties weremerged and new seats were designated. Now Hungary has 19 counties and the capital. SeeFigure 1.

The precise border of the counties may vary over time. Previously, inhabitants of settle-ments wanted to belong to Pest County, hoping that the central region would receive greatergovernmental attention. Now the EU provides developmental aid only for the disadvanta-geous regions, so inhabitants of settlements now want to belong to the poorer regions. Forexample, Nagykoros (a city in Pest County) recently held an unsuccessful referendum aboutbeing part to the neighboring Bacs-Kiskun County.23

Hungary has around 3150 settlements and 13,000 localities. The capital of Hungary,Budapest, now has 23 districts. There are lots of street names like Petofi, Rakoczi, Vorosmarty,Batthyany and Kossuth that can be found in practically every district. Many locality nameslike Allami Gazdasag, Erdeszhaz and Facanos can be found in several distant places ofHungary. Thus, if the address of a letter contains only the number, the street, and the city/village name, this may not necessarily determine the exact geographic location.

Figure 1. Hungarian counties and the capital. The map was obtained from the Hungarian CentralStatistical Office homepage http://www.ksh.hu/regional_atlas_administration_structure (retrieved12 October, 2013).

Table 1. The first digit of zip codes and the corresponding geographic locations.

Zip code Geographic location

1xxx Budapest (capital)2xxx Central part around Budapest3xxx Northern part4xxx Northern part of the Great Plane5xxx Middle part of the Great Plane6xxx Southern part of the Great Plane7xxx Southern part of Transdanubia8xxx Central part of Transdanubia9xxx Western part of Transdanubia

International Review of Law, Computers & Technology 29

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 12: Does fair anonymization exist?

The zip code system was introduced in 1972 by the Hungarian Post to speed up deliv-eries. Since then, the address has to include a zip code that disambiguates the address, i.e.specifies the exact geographic location of the addressee. Every Hungarian zip code has fourdigits., Table 1 shows the correspondence between the first digit of the code and thegeographical location. A detailed map of two-digit zip code areas is shown in Figure 2.

6. The itemized medical database (IMD)

The so-called Itemized Medical Database (in Hungarian TEA, Teteles EgeszsegugyiAdattar24) was created from three accounting databases of the National Health InsuranceFund25 by Decree No. 76 of 2004, issued by the Ministry of Health, Social and FamilyAffairs on 19 August in 2004. These databases were the prescription database, the outpa-tient database and the inpatient care events database. The prescription database at thattime contained all expedited prescription-only medicines purchased by Hungarian patientsregardless of whether they were subsidized or not.26

In each database, the social security identifiers (in Hungarian, TAJ, tarsadalombiztosıtasiazonosıto jel) were replaced by a pseudonym. TAJ is a nine-digit identifier that was replacedby a pseudo-TAJ, which is also a nine-digit number. In fact, the ninth digit is a control numberthat is computed from the preceding eight digits.27 In this way, 100 million valid TAJ identi-fiers can be generated. The 000,000,000 number is a valid TAJ identifier.

The National Health Insurance Fund has been collecting accounting data from care eventselectronically since 1 January 1998. It preserves the tabular correspondence between the TAJsand pseudo-TAJs as well. In 2004, the contents of the three existing databases were pseudo-nymized and identical copies of the datasets were sent out to four hosts. From that time, the

Figure 2. First two digits of postcodes in Hungary. This map was shared in Wikipedia under theCreative Commons 3.0 licence. The declaration of the creator (Gfk GeoMarketing) on free usage isarchived in the Wikimedia OTRS System.

30 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 13: Does fair anonymization exist?

fund has been sending out updates every quarter. Not long ago the number of recipient hostsfell to two due to reorganization (National Institute for Quality- and Organizational Develop-ment in Healthcare and Medicines28; National Public Health and Medical Officer Service29).

6.1. The Hungarian legal environment of the Itemized Medical Database

According to Section 5 of the Hungarian Data Protection Act,30 personal data can be pro-cessed when the data subject gives consent or it is provided for by law. Those paragraphs ofthe 95/46/EC Data Protection Directive Article 7 b), e), f) that simply allow the processingof personal data in specified circumstances for the purposes of legitimate interests of thedata controller or third party, carrying out of a public task, entering into a contractual con-nection – have not been implemented since 1992. Not counting the case when personal dataprocessing is carried out in the life interest of the data subject, there exist only two possibleways of performing data processing: obtaining consent from data subjects or passing a lawon obligatory data processing.

As a consequence, medical data processing is controlled by a secondary data protectionlaw known as the Health Data Protection Act (HDPA).31 According to this law, processingmedical data is always obligatory. There is only one case where medical data processing isdone upon consent; this is the case of prospective medical database research. Here, datasubjects have to agree on the collection of personal medical data related to them.32

Section 20, Paragraph 4 of the HDPA authorizes the health minister to issue a decree ontransferring personally non-identifiable medical data to health government bodies fromthe Nation Health Insurance Fund or from healthcare institutions.

Based on this authorization the minister issued his Decree No. 76 of 2004 in order toestablish the IMD from the existing data and its quarterly updates. The National HealthInsurance Fund has to produce and transfer the pseudonymized data.

The SNIIR-AM (Systeme National d’Information Interregimes de l’Assurance Maladie)is a similar medical database in France that stores health insurance accounting data from indi-viduals. The main purpose of the database is to improve the quality of healthcare and the man-agement of health insurance. It stores itemized personal data for three years, after that data arearchived for another ten years.33 The database stores the month and the year of birth instead ofthe date of birth. The national health identifier is replaced twice by a pseudonym with the helpof uninterested third parties. Processing pseudonymized data for research purposes is subjectto the approval of the Institute of Health Data or the data protection authority.

The Secondary Use Service (SUS) stores health insurance accounting information frompatients in the United Kingdom. It stores itemized accounting information for an unspeci-fied time for the purposes of improving the quality of healthcare. Researchers can obtaindata from this database either in effectively anonymized form or with the approval of theHealth Research Authority.

In both of these foreign examples there is public control over the use of the database.According to the privacy policy of these public authorities, a waiver from the consentrequirement is given only if the data are truly anonymous or obtaining consent is impossibleor requires disproportionate effort.

6.2. The Hungarian legal environment versus the current EU data protectionregulation

Article 8, Paragraph 3 of the EU Data Protection Directive34 says that special categories ofpersonal data such as health data can be processed ‘for the purposes of preventive medicine,

International Review of Law, Computers & Technology 31

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 14: Does fair anonymization exist?

medical diagnosis, the provision of care or treatment or the management of health-care ser-vices’. Quality and financial control are generally treated as management tasks. Thus, col-lecting personal health data for management purposes is permitted by the EU DataProtection Directive. The unlimited retention time though, may violate Article 6 (e),which requires that the data be deleted when it is no longer needed.

Needless to say, the IMD database is used for research purposes as well, which is compa-tible with the original purposes (management) based on Article 6 (b). When data are processedfor research purposes, a member state should provide appropriate safeguards. Section 21 ofthe Hungarian HDPA says that when medical data are processed for research purposes, alldirect personal identifiers have to be removed from the databases. In the light of thecurrent paper this cannot be appropriate, since the remaining demographic data can still ident-ify individuals to a high probability. Article 6 (e) of the EU Data Protection Directive similarlyrequires that if datasets are stored for a longer period of time for research purposes, memberstates should implement safeguards to prevent the unlawful identification of data subjects.

Medical ethics requires that research subjects should give informed consent to beenrolled in a research project. But there are exceptions, such as when obtaining consentrequires disproportionate efforts or it is impossible; then both the ethics norms and theEU Data Protection Directive allow processing data without consent. Moreover, in the pre-amble (34) the directive authorizes member states to pass laws on obligatory data proces-sing for research purposes where important public interests so justify. The Hungarianapproach, where processing personal health data retrospectively is always obligatory bylaw and being done without consent, is far from the EU principles.

International medical research ethics requires the involvement of an ethics committee inthe decision of whether a planned study on personal health data can be performed, if itmeets the ethical norms, and will not compromise the privacy of the research subjects. Inaddition, the operation of these ethics committees should be transparent and their positivedecisions should be publicized. Hungary does not allow ethics committees to intervenewhen the data in the national medical databases are processed, saying that the datasetsare anonymous. This practice substantially violates accepted international ethical norms.

6.3. The Hungarian legal environment versus the planned new EU data protectionregulation

The European Parliament is working on a new data protection regulation that is intended toreplace the current 95/46/EC Data Protection Directive. A draft of the so-called General DataProtection Regulation (GDPR) was published by the European Commission on 25 January2012.35 According to Article 81 paragraph 1 (c) of the GDPR, processing medical data isallowed for the purposes of ‘ensuring the quality and cost-effectiveness of the procedures’.Paragraph 2 says that ‘processing of personal data concerning health which is necessaryfor historical, statistical or scientific research purposes, such as patient registries set up forimproving diagnoses and differentiating between similar types of diseases and preparingstudies for therapies’ is also allowed if appropriate safeguards are provided. Data processingfor the purposes of medical research can be obligatory by law ‘when substantial public inter-est justifies so’ (see Recital 42). In brief, nothing has changed.

Up to now, more than 4000 amendments have been submitted to the GDPR. Oneremarkable amendment package was submitted by the Committee on Civil Liberties,Justice and Home Affairs of the European Parliament on 17 December 2012.36 The com-mittee recommended an amendment to Recital 4237 so that medical database researchcould not be performed by the force of law. Furthermore, they advised an amendment to

32 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 15: Does fair anonymization exist?

Article 81, paragraph 2, saying that ‘processing of personal data concerning health which isnecessary for scientific research purposes shall be permitted only with the consent of thedata subject’. If this amendment package is accepted, then it will radically change thecurrent approach to medical data processing for research purposes. At the end of 2009the author and P. Konyves-Toth on behalf of the Association for Fair Data Processingsent an opinion to the European Commission in response to the call for public consultationon the legal framework for the fundamental right to protection of personal data.38 In Section5 of this document, the author asked to announce that in the European Union, processing ofpersonal medical data for research purposes should be based on voluntary consent except inthose rare cases when obtaining consent requires disproportionate effort or is impossible.

6.4. Open privacy problems concerning to the Itemized Medical Database

The IMD database was declared anonymous by the Decree, but in fact it is not anonymous.Comparing the IMD with the HIPAA Privacy Rule, the database contains institution/depart-ment codes, medical diary serial numbers, treating and referring physician’s identificationnumbers, dates of birth, death, admission and discharge and the full zip code of the residentialaddress. Since Hungary has a public registry of all licensed physicians39 with their names,specialities, and workplace(s), practically each care event is paired with the name and work-place (and phone number) of the treating physician. By applying pseudo-TAJ identifiers, allmedical care events associated with a given Hungarian citizen since 1998 can be combined.Pro familia interventions and prescriptions are marked; hence family members of doctors,300–400,000 people can be directly identified by the help of the public registry of physicians,since the identity of the prescribing physician and the age of the patient (i.e. the age of theclose family member, like spouse, child, or parent) are given.

The recipient hosts not only have the IMD, but several other personal medical filesystems such as the Vaccination Registry, the National Cancer Registry, the NewbornDevelopmental Disorder Registry and the Screening Registry. These contain the names,addresses, diagnoses, and/or TAJs of the patients. The recipient hosts obviously have thepossibility of linking these databases together to reveal the complete medical history of agiven patient. They can trace almost anybody.

The author challenged Decree 76 of 2004 before the Constitutional Court, but the court didnot understand the concept of anonymity. They arrived at the decision in 2007 (case no. 937/B/2006)40 that since the database did not contain natural personal identification data or a TAJ ID,it was anonymous. The court declared that since combining databases without legal authoriz-ation was unlawful, it must not be occur; hence it did not need to be taken into account.

The IMD – being an anonymous dataset – is not subject to ethical supervision. Anyonehaving a connection to one of the recipient hosts can get access to some of the data. Norecords are made available of industrial, commercial or research partners who weregranted access to the database. Although the nomination of a Data Protection Officer hasbeen a legal obligation since 1 January 1998, the National Institute for Quality- and Organ-izational Development in Healthcare and Medicines nominated an officer, only in 2013,after the author once again sent a complaint to the President of the National Data Protectionand Freedom of Information Authority.

7. Estimating the probability of re-identification of the IMD

Several teams mentioned in Section 4 tried to estimate the re-identification risk of some anon-ymized medical research databases. None of them was able to use exact population

International Review of Law, Computers & Technology 33

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 16: Does fair anonymization exist?

demographic data. Instead, the authors had to make do with partial databases, such as amarket research database, a voters’ database or the census database. None of these databasescontained the exact date of birth or details of all persons concerned. All of these papers wereunsatisfactory in some way. The author had the opportunity to use a portion taken from theofficial Hungarian national population registry. He contacted the Central Office for Adminis-trative and Electronic Public Services, which is the controller of the registry and requesteddistribution data. It was sent to the author’s department in January of 2013.

7.1. The national population registry data

The Population Registry Research Dataset (PRRD) is 270 MB of text data. The dataset isorganized as follows (see Figure 3).

Each line contains four data items separated by a semicolon: zip code, date of birth (in year-month-day order), gender (N is female, F is male), and the number of people living in the givenzip code area having the given date of birth, and gender. An end-of-line symbol terminates eachrecord. The PRRD contains data on 10,004,090 living individuals who have a registered resi-dential address in Hungary and were born on or before 31 December 2011.

The distribution of the population by year of birth is shown in Figure 4. The weights arenormalized; i.e. the sum of all weights across years equals 100%. The graph has two peaks;namely, those born in 1954 have a 1.7% weight and those born in 1975 have a 1.9% weight.The ratio of those born before 1953 decreases almost linearly.

Figure 3. The structure of the population registry research dataset. The original values were delib-erately altered by the author. There is no such zip code as 6188 in Hungary.

Figure 4. The weight of the subpopulation born in the given year.

34 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 17: Does fair anonymization exist?

Figure 5 shows the distribution of zip code sizes. There are 3110 different zip codes.Of these, 73 zip code areas have a population of over 20,000, 475 have a population ofover 5,000, and 2635 have a population of below 5000 (see Table 2). 35.24% of theHungarian population live in the country in a small village that has fewer than 5000inhabitants, while 64.76% of inhabitants live in urban areas where the population isover 5000.

7.2. Computing the re-identification risk using PRRD

The re-identification risk estimation is based on the concept of k-twins.k-twinsFor any given integer k, if the PRRD contains exactly k persons having the same quasi-

identifiers (e.g. zip code, date of birth, gender), then this set of people is called k-twins.In other words, a set of k-twins can be viewed as a bin that contains exactly k individuals

(Golle 2006). A detailed explanation was given in Section 4, along with a mathematicalformula (2). The number of k-distinct people can be got by summing all the people inthe bins containing 1 ≤ i ≤ k people:

number of (k−distinct) =∑

1≤i≤k

i · number of (i − twins) (3)

Figure 5. The settlement structure of Hungary.

Table 2. Settlement structure in numbers.

PopulationNumberof Zips

Allinhabitants

Percentage(%)

Percentage(%)

over 20,000 73 2,594,802 25.94 64.76below 20,000 and over 5000 402 3,883,348 38.82below 5000 and over 1000 1296 2,800,312 27.99 35.24under 1000 1339 725,628 7.25Total 3110 10,004,090 100.0

International Review of Law, Computers & Technology 35

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 18: Does fair anonymization exist?

The author followed the approach presented by Benitez and Malin (2010). The re-identification risk for k-distinct people is defined as follows:

risk(k − distinct) = identifiable people

all people=

∑1≤i≤k

i · number of (i − twins)

all people(4)

This formula assumes that k-distinct people can always be identified with the help of someexternal information. Such information could be a favored institution where the target indi-vidual was treated or a date when an individual was treated. In spite of this, this pessimisticapproach may not always be realistic. When a specific individual among k people can beidentified with a certain probability, then the above formulae have to be modified.

Now let us assume that an adversary would like to identify a specific person whosedemographic characteristics (quasi-identifiers) are known. The attacker could select thoserecords from the database that match the given quasi-identifiers. If there are severalrecords that match, then he can randomly select one. The probability of the correct identi-fication is at most 1/i if the selected records are i-twins. Applying such an algorithm, the re-identification risk is less and can be computed as:

risk(k − distinct) =

∑1≤i≤k

i · number of (i − twins) · 1

i

all people=

∑1≤i≤k

number of (i − twins)

all people(4∗)

From the PRRD, the number of k-twins can be readily calculated. The results are listedin Table 3. The results show that the total Hungarian population is 11-distinct. The esti-mated identification risks using formula (4) are 78.43%, 94.0%, and 99.8%; usingformula (4∗) they are 78.43%, 86,21%, and 87,98% for 1-distinct, 2-distinct and 5-distinct,respectively.

It is worth returning to the rough calculation made in Section 2, where Dr. Golle’sformula (2) was applied. With that method of estimation, a figure of 98.44% was obtainedusing the pessimistic formula (4), and a figure of 90.6% was achieved using the more rea-listic formula (4∗) as a re-identification risk for 2-distinct people in a zip code area having8–10,000 inhabitants.

Figure 6 shows how the re-identification risk varies with age using formula (4). Thethree graphs show the potential risk values for 1-distint, 2-distinct and 5-distinct persons.Needless to say, all three graphs start from 100% and then slowly decrease. In the last20 years there were some small increases in the risk values due to fewer births. Theminimum value for 1-distinct was 77.57% in 1983, for 2-distinct it was 93.41% in 1982

Table 3. The number of k-twins among the Hungarian population using a four-digit zip code, exactdate of birth, and sex.

The number of k-twins in the PRRD

k 1 2 3 4 5 6

number of k-twins 7,845,850 779,027 136,968 31,905 8,353 2,316k 7 8 9 10 11 12number of k-twins 629 135 43 12 1 0

36 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 19: Does fair anonymization exist?

and for 5-distinct it was 99.75% in 1981. This means that if we examine just the adult sub-population (those born before 1981), then the risk of re-identification is a slightly less com-pared with the case when we examine the whole population. The risk for 5-distinct personsis always greater than 99.75%.

The above calculation can be repeated using the more realistic formula (4∗). Figure 7shows the new plots. The minimum risk for 1-distinct was 77.57% in 1983, for 2-distinctit was 85.49% in 1982 and for 5-distinct it was 87.42% in 1982.

7.3. Reducing the re-identification risk by data generalization

The PRRD dataset allows one to perform experiments with different types of generaliz-ations. In the following, several examples and their results are presented: the 3-digit zipcode generalization, the 2-digit zip code generalization, the year-month generalization of

Figure 6. The ratio for n-distinct persons among the Hungarian population for persons over a givenage using formula (4).

Figure 7. The ratio for n-distinct persons among the Hungarian population for persons over a givenage using formula (4∗).

International Review of Law, Computers & Technology 37

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 20: Does fair anonymization exist?

the date of birth, the year of birth generalization, the 3-digit zip code and the year-monthgeneralization, the three-digit zip code and the year of birth generalization.

According to a study in Canada (El Emam, K. et al., 2009), a database should not beshared with researchers when the re-identification risk is above 33%; an additionalprivacy measure should be applied when the re-identification risk is above 20%, butbelow 33%. If the risk is below 20%, then the database can be released.

Table 4 shows the results of the three-digit generalization. This means that the last digitof the zip was deleted and only the remaining first three digits were put into the risk-reduced database. The computed re-identification risks obtained were 57.86%, 82.09%,and 98.71% with the pessimistic formula (4) and 57.86%, 69.98%, 74.88% with the rea-listic formula (4∗). The three-digit generalization did not substantially reduce the re-identi-fication risk.

Table 5 shows the results of the two-digit generalization. The computed risks obtainedwere 14.81%, 33.57% and 71.29% using the pessimistic formula (4) and 14.81%, 24.19%and 34.54% with the realistic formula (4∗).

Table 6 shows the results of our year and month of birth generalization. The computedrisks were 15.0%, 27.79% and 50.96% with the pessimistic formula (4) and 15.0%, 21.34%and 27.65% using the realistic formula (4∗).

Table 7 shows the results of our year of birth generalization. The computed risks were0.59%, 1.63% and 6.24% with the pessimistic formula (4) and 0.59%, 1.10% and 2.29%with the realistic formula (4∗).

When the three-digit zip code and the year and month of birth generalization wasapplied, the computed risks were 1.85%, 5.02% and 18.18% with the pessimisticformula (4) and 1.85%, 3.44% and 6.83% with the realistic formula (4∗). In the case ofthe three-digit zip code and the year of birth generalization the computed risks were0.037%, 0.081% and 0.27% with the pessimistic formula (4) and 0.037%, 0.059% and0.11% with the realistic formula (4∗).

Table 4. The number of k-twins among the Hungarian population using the three-digit zip code,exact date of birth, and sex.

The number of k-twins in the PRRD

k 1 2 3 4 5 6

number of k-twins 5,788,287 1,212,048 339,838 112,431 38,703 13,411k 7 8 9 ≥ 10number of k-twins 4407 1451 437 202

Table 5. The number of k-twins among the Hungarian population using the two-digit zip code, exactdate of birth, and sex.

The number of k-twins in the PRRD

k 1 2 3 4 5 6

number of k-twins 1,482,052 937,932 546,408 311,166 177,947 103,848k 7 8 9 ≥ 10number of k-twins 61,830 38,253 25,033 89,277

38 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 21: Does fair anonymization exist?

When the two-digit zip code and the year and month of birth generalization was applied,the computed risks were 0.056%, 0.12% and 0.39% with the pessimistic formula (4) and0.056%, 0.087% and 0.156% with the realistic formula (4∗). In the case of the two-digitzip code and the year of birth generalization the computed risks were 0.0037%, 0.0083%and 0.022% with the pessimistic formula (4) and 0.0037%, 0.0060% and 0.0096% withthe realistic formula (4∗).

When the HIPAA Privacy Rule was applied only one person (out of the 10 million) wasuniquely identifiable, seven people were identifiable in the 5-distinct pessimistic case andthree people were identifiable in the 5-distinct realistic case.

7.4. Can Dr. Golle’s model be confirmed by the PRRD dataset?

The PRRD allows us to verify Golle’s hypothesis and formula (1) on the expected numberof i-twins within a population. Figure 8 shows how the number of births varies by the

Table 6. The number of k-twins among the Hungarian population using the four-digit zip code, yearand month of birth, and sex.

The number of k-twins in the PRRD

k 1 2 3 4 5 6

number of k-twins 1,500,071 634,465 319,279 187,768 124,079 87,232k 7 8 9 ≥10number of k-twins 64,881 49,372 37,533 183,482

Table 7. The number of k-twins among the Hungarian population using the four-digit zip code, yearof birth, and sex.

The number of k-twins in the PRRD

k 1 2 3 4 5 6

number of k-twins 58,591 51,970 45,620 39,608 33,372 28,653k 7 8 9 ≥ 10number of k-twins 24,339 20,979 18,139 213,335

Figure 8. The weight of the subpopulation born in a certain month by decades.

International Review of Law, Computers & Technology 39

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 22: Does fair anonymization exist?

month. The number of persons born in a given month is summed by decades and normal-ized; i.e. the sum of the percentages for the 12 months is 100%. If the number of births islinearly distributed, then for each month the expected value of the births is 1/12 ¼ 8.33%.A horizontal line highlights this expected value. The heights of the bars denote the actualvalues. The actual birth numbers always fall off between 8% and 9%. This means they arepractically linearly distributed.

The author selected seven zip codes and listed the number of k-twins in these zip codeareas in Table 8. In this computation, the four-digit zip code, exact date of birth and genderquasi-identifiers were used.

Table 9 lists the same zip codes, but the number of k-twins was computed by applyingGolle’s formula (2). It can readily be seen that child birth is a probabilistic process. Actualvalues cannot be expected to be equal to the predicted values, but Tables 8 and 9 reveal agood correlation. In the case of the other zip codes, a similar pattern was observed.

The author modified formula (2) slightly to get more precise results. The given populationwas split into years using the weights shown in Figure 4. Half of the yearly population weretreated as male and the other half as female. The number of i-twins was then computed bysumming up male and female i-twins for all years between 1907 and 2011. The weightsbefore 1907 were practically zero; hence these years were not included in the calculation.

8. Conclusions

In 2004, the Hungarian government created the Itemized Medical Database, which holdsaccumulated pseudonymized medical care data for the whole population and goes backto the year 1998. The data are retained for an unspecified time by law. The NationalHealth Insurance Fund is responsible for sending quarterly updates to two recipienthosts. The fund simply replaces the national TAJ identifiers by a pseudo-TAJ and does

Table 8. Examples on distribution of k-twins.

Zip Population 1-twins 2-twins 3-twins 4-twins 5-twins 6-twins

6500 32,660 18,705 4997 1072 155 254060 17,795 12,945 2065 213 19 16237 8829 7473 613 38 46635 4699 4331 181 28248 2969 2792 84 38096 1306 1272 177381 817 807 5

Table 9. Distribution of k-twins computed using Golle’s formula.

Zip Population 1-twins 2-twins 3-twins 4-twins 5-twins 6-twins

6500 32,660 19,039.98 5029.45 957.79 143.76 17.95 1.934060 17,795 13,240.45 1936.18 202.29 16.57 1.126237 8829 7629.82 554.96 28.51 1.14 0.046635 4699 4347.56 166.39 4.43 0.098248 2969 2828.17 67.25 1.098096 1306 1271.14 12.32 0.087381 817 806.98 4.49

40 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 23: Does fair anonymization exist?

not change the other data components of its accounting dataset. This means that the pseu-donymized dataset still contains the date of birth, the gender, the zip code (if not, then thezip code related to a pseudo-TAJ is sent by the fund in a separate file), and many otherquasi-identifiers, such as institution/department codes, the treating or referring physician’sidentifier, operation/medical diary numbers, exact date of admission, treatment, dischargeand death. Patients do not have the right to object to transferring data to this database orto specific uses of this dataset, which by law is considered anonymous. From this itfollows that there are no records about who actually had access to the database and whoobtained portions of it.

The author obtained distribution data from the national population registry that con-tained the zip code, date of birth and gender distribution snapshot from the whole popu-lation taken on 31 December 2011. The PRRD database contained data about10,004,090 inhabitants. The Itemized Medical Database is a superset of the PRRDbecause deceased persons were not deleted from the IMD. Thus, if 1% of the populationdie every year, then over a period of 15 years the IMD could have a surplus of 15%.Today, it may contain data about 11–12 million people. It is hard to not to be includedin this database because there are obligatory medical screenings and examinations inHungary, and each one will generate a record in the IMD.

The author processed the statistical data in the PRRD. The results can help to estimate there-identification risk of the current database or develop more secure anonymization techniquesspecifically designed for the Hungarian zip code system and population. The direct 1-distinctre-identification risk was 78%, which is quite high, but it depends on the particular location. Inthe provinces it was well above 90%. From this it follows that the IMD cannot be viewed as afairly anonymized database. The author computed re-identification risk values after makingseveral generalizations. The results are summarized in Table 10.

The year of birth generalizations provided the best results. When it was combined withthe two-digit zip code generalization, the risk was 0.022% in the 5-distinct pessimistic case.It still means that 2200 persons out of 10 million are identifiable. With the HIPAA PrivacyRule, it produced extremely good results. In the pessimistic case only seven people out ofthe 10 million were identifiable (0.00007%).

The PRRD provided a good opportunity to test Dr. P. Golle’s formula, and the exper-imental data agree quite well with the values obtained by using formula (2).

The author believes that fair anonymization exists, but only statisticians can judgeobjectively whether a database is truly anonymous. A small re-identification risk is accep-table, which should be below 0.1%. According to the case reports and the author’s results,the HIPAA Privacy Rule meets this requirement.

Table 10. Re-identification risk after applying different generalizations.

Generalization 1-distinct (%)5-distinct

(pessimistic) (%)5-distinct

(realistic) (%)

Three-digit zip code 57.86 98.71 74.88Two-digit zip code 14.81 71.29 34.54year and month of birth 15.0 50.96 27.65year of birth 0.59 6.24 2.29Three-digit zip, year and month of birth 1.85 18.18 6.83Three-digit zip, year of birth 0.037 0.27 0.11Two-digit zip, year and month of birth 0.056 0.39 0.15Two-digit zip, year of birth 0.0037 0.022 0.0096HIPAA Privacy rule 1 people 7 people 3 people

International Review of Law, Computers & Technology 41

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 24: Does fair anonymization exist?

AcknowledgementThe author would like to thank the Central Office for Administrative and Electronic Public Services(in Hungarian: Kozigazgatasi es Elektronikus Kozszolgaltatasok Kozponti Hivatala, KEKKH) for theresearch dataset taken from the national population registry.

Funding

This work was partially supported by the European Union and the European Social Fund throughproject FuturICT.hu (grant no. TAMOP-4.2.2.C-11/1/KONV-2012-0013).

Notes1. In Paul Ohm, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymiza-

tion, footnote 223.2. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the

protection of individuals with regard to the processing of personal data and on the free move-ment of such data, Article 6 paragraph 1 (a).

3. The law was intended to establish controlled information flow between covered entities such ashealth service providers, health insurance companies and state supervisory authorities all overthe United States as well as protecting patients’ privacy. It has come into force in a step-by-stepfashion since 1996. The latest modifications to the HIPAA Privacy, Security, Enforcement andBreach Notification Rules came into force in September 2013.

4. See Evans (2011, 72).5. See http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/

guidance.html (retrieved 12 October, 2013).6. See Ohm (2010, 1721).7. Internet Movie Database, http:// http://www.imdb.com/8. See Information Commissioner for UK (2012, 17).9. The latest, amended version of the declaration (dated 2008) can be found on the web page of the

WMA: http://www.wma.net/en/30publications/10policies/b3/ (retrieved 12 October, 2013).10. On 19 August 1947, the court delivered the verdict in the ‘Doctors’ Trial’ in Nuremberg against

doctors involved in human experiments in concentration camps. The court applied these basicethical principles in the decision, accepting Dr. Leo Alexander’s six points and adding fourothers. See: http://www.hhs.gov/ohrp/archive/nurcode.html (retrieved 12 October, 2013).

11. Convention for the Protection of Human Rights and Dignity of the Human Being with regard tothe Application of Biology and Medicine: Convention on Human Rights and Biomedicine, No.CETS-164, Oviedo, 4 April 1997.

12. There may be several reasons for it. For example, the data subject has already died, gettingconsent from many data subjects requires disproportionate effort, or the data subjects wouldprobably deny their consent.

13. See Caldicott (2013, 79).14. UK High Court decision EWHC 1430 (20 April, 2011), http://www.bailii.org/ew/cases/EWHC/

Admin/2011/1430.html (retrieved 12 October, 2013).15. See the decision in the HuDOC database: http://www.echr.coe.int/ECHR/EN/Header/Case-

Law/Decisions+and+judgments/HUDOC+database/16. http://www.census.gov/17. The place could be a town, city or municipality. Several five-digit zip codes may be associated

with a single geographical place.18. It contains aggregated data taken from all places within that county.19. Dr. Philip Golle (Golle 2006) wanted to reproduce this figure by making use of the US 2000

Census database, but he found that 63% of the US population were uniquely identifiable.20. http://mathworld.wolfram.com/BinomialCoefficient.html (retrieved 12 October, 2013).21. By definition, a quasi-identifier is not a unique identifier but when applied in combination with

other quasi-identifiers it may uniquely identify a person.22. They found five quasi-identifiers: sex, length of stay in days, the quarter for admission, region,

and age in weeks for newborn babies.23. The referendum of 8 April 2013 in Nagykoros was declared invalid due to a poor turnout, URL:

http://www.pestmegyei-hirhatar.hu/hir/ervenytelen-lett-a-nagykorosi-nepszavazas (retrieved 12October, 2013).

42 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 25: Does fair anonymization exist?

24. http://adatgyujtes.gyemszi.hu/TEA/ (retrieved 12 October, 2013).25. http://www.oep.hu26. In 2009, the Constitution Court issued decision No. 29/2009 on 21 March, where they ruled that

the National Health Insurance Fund must not collect personal data from unsubsidized healthcare events (including purchasing unsubsidized prescription-only medicines). The author didmuch to support the interests of the case.

27. If TAJ is d1d2d3d4d5d6d7d8d9 then d9 ¼ [3∗(d1 + d3 + d5 + d7) + 7∗(d2 + d4 + d6 + d8)]mod 10, see Act XX of 1996 on Personal Identification Methods and Identification Codes.

28. http://www.gyemszi.hu29. http://www.antsz.hu30. Its official name is: Act CXII of 2011 on Informational Self-Determination and Freedom of

Information, which came into force on 1 January 2012. An English translation can be gotfrom the homepage of the National Authority for Data Protection and Freedom of Information.As it happens, the text is not completely faithful. See: http://www.naih.hu/files/ActCXIIof2011_mod_lekt_2012_12_05.pdf (retrieved 12 October, 2013).

31. Its official name is: Act XLVII of 1997 on Processing and Protection of Health Data and Per-sonal Data Related to them.

32. Retrospective medical research is supported by force of law, without the possibility of legalremedy. Although the Data Protection Act contains the right to object in the case of scientificresearch, the Health Data Protection Act does not. As regards the ‘lex specialis derogat lex gen-eralis’ principle, the issue of the right to object is currently under debate.

33. Arrete du 11 juillet 2012 relatif a la mise en œuvre du systeme national d’information inter-regimes de l’assurance maladie, URL: http://www.legifrance.gouv.fr/affichTexte.do?cidTexte¼JORFTEXT000026221180&dateTexte¼&categorieLien¼id (retrieved 12 October,2013).

34. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on theprotection of individuals with regard to the processing of personal data and on the free move-ment of such data.

35. Proposal for a Regulation of the European Parliament and of the Council on the protection ofindividuals with regard to the processing of personal data and on the free movement of suchdata (General Data Protection Regulation), http://eur-lex.europa.eu/smartapi/cgi/sga_doc?smartapi!celexplus!prod!CELEXnumdoc&lg¼en&numdoc¼52012PC0011 (retrieved 12October, 2013).

36. http://www.europarl.europa.eu/meetdocs/2009_2014/documents/libe/pr/922/922387/922387en.pdf (retrieved 12 October, 2013).

37. Amendment 27 on page 24.38. http://ec.europa.eu/justice/news/consulting_public/0003/contributions/organisations_not_

registered/association_for_fair_data_processing_en.pdf (retrieved 12 October, 2013).39. The service can be accessed using http://kereso.eekh.hu/ (retrieved 12 October, 2013).40. The Hungarian decision can be found on the homepage of the Constitutional Court: http://

public.mkab.hu/dev/dontesek.nsf/0/059A5C0C4D459EF7C1257ADA00529A2E?OpenDocument (retrieved 12 October, 2013).

ReferencesBenitez, K., and B. Malin. 2010. “Evaluating Re-Identification Risks with Respect to the HIPAA

Privacy Rule.” Journal of the American Medical Informatics Association 17 (2): 169–177.doi:10.1136/jamia.2009.000026.

Caldicott, F. 2013. “Information: To Share Or Not To Share? The Information Governance Review.”Independent review of how information about patients is shared across the health and care system.United Kingdom, Department of Health, Published 26 April 2013. URL: https://www.gov.uk/government/publications/the-information-governance-review (retrieved 12th October, 2013).

Cutrona, S. L. et al. January 2012. “Design for Validation of Acute Myocardial Infarction Cases inMini-Sentinel.” Pharmacoepidemiology and Drug Safety, Supplement: The U.S. Food andDrug Administration’s Mini-Sentinel Program 21 (Supplement S1): 274–281. doi:10.1002/pds.2314.

International Review of Law, Computers & Technology 43

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4

Page 26: Does fair anonymization exist?

El Emam, K., F. K. Dankar, R. Vaillancourt, T. Roffey, and M. Lysyk. 2009 Jul. “Evaluating the Riskof Re-Identification of Patients from Hospital Prescription Records.” Canadian Journal onHospital Pharmacy 62 (4): 307–319. PMCID: PMC2826964.

Evans, B. J. Fall 2011. “Much Ado About Data Ownership.” Harvard Journal of Law and Technology15 (1): 69–130. University of Houston Law Center No. 1857986. Accessed October 12th, 2013.http://ssrn.com/abstract¼1857986

Golle, P. 2006. “Revisiting the Uniqueness of Simple Demographics in the US Population.” InProceedings of the 5th ACM workshop on Privacy in electronic society, pp. 77–80. ACM.

Information Commissioner’s Office (ICO). 2012. United Kingdom, Anonymisation: managing dataprotection risk, code of practice, November, 2012, Accessed October 12th, 2013. http://www.ico.org.uk/for_organisations/data_protection/topic_guides/�/media/documents/library/Data_Protection/Practical_application/anonymisation_code.ashx

Kwok, P., M. Davern, E. Hair, and D. Lafky. 2011. Harder than you Think: A Case Study of re-Identification Risk of HIPAA-Compliant Records. Chicago: NORC at The University ofChicago. Abstract #302255.

Machanavajjhala, A., D. Kifer, J. Gehrke, and M. Venkitasubramaniam. 2007. “l-Diversity: PrivacyBeyond k-Anonymity.” ACM Transactions on Knowledge Discovery from Data (TKDD) 1 (1):3. doi:10.1145/1217299.1217302.

Ohm, P. 2010. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization,(August 13, 2009).” UCLA Law Review 57: 1701–1777. U of Colorado Law Legal StudiesResearch Paper No. 9–12. Accessed October 12th, 2013. http://ssrn.com/abstract=1450006

Office for Civil Rights. 2012. Guidance Regarding Methods for De-identification of Protected HealthInformation in Accordance with the Health Insurance Portability and Accountability Act(HIPAA) Privacy Rule November 26, 2012, Accessed October 12th, 2013. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/hhs_deid_guidance.pdf

Samarati, P. Nov/Dec 2001. “Protecting Respondents Identities in Microdata Release.” IEEETransactions on Knowledge and Data Engineering 13 (6): 1010–1027. doi:10.1109/69.971193.

Sweeney, L. 2000. Simple Demographics Often Identify People Uniquely. Carnegie MellonUniversity, Data Privacy Working Paper 3. Pittsburgh 2000. Accessed October 12th, 2013.http://dataprivacylab.org/projects/identifiability/paper1.pdf

44 Z. Alexin

Dow

nloa

ded

by [

Uni

vers

ity o

f B

ath]

at 1

2:16

06

Oct

ober

201

4