privacy&consideraons&of&...
TRANSCRIPT
8/25/15
1
HotSec 2013
Privacy Considera9ons of Genome Sequencing
E. Ayday, E. De Cristofaro, J.-‐P. Hubaux, G. Tsudik With contribu9ons from
Z. Huang, M. Humbert, J.-‐L. Raisaro
Many thanks to gene9cists J. Fellay, P. McLaren and A. Telen9
On Convergence…
2
``The last inch´´
Digital medicine: -‐ Digital medical records -‐ Digital imaging -‐ Medical online social networks -‐ Genome sequencing -‐ Other ´omics data -‐ Wireless biosensors …
Telecom Compu9ng
Modern IT
…0100110100011… …CGTTAATTCCGTA…
8/25/15
3
GATTACA (1997 Movie)
Medical Use of Gene9cs • Gene9c disease risk tests help early diagnosis of serious diseases
• Pharmacogenomics è personalized medicine
6
8/25/15
4
The SNP • Human Genome iden9cal in most places for all people
• SNP (Single Nucleo9de Polymorphism) è posi9ons where some people have one nucleo9de pair while others have another
7 Linkage disequilibrium (LD): correla9on between the alleles of SNPs
(usually located close to each other)
Key Concepts of Genomics
• Our gene9c informa9on is stored in the sequence of DNA, which is made of four nucleo9des: A, T, G, and C
• The human genome is ~3 billion nucleo9des long, and packaged into 23 pairs of chromosomes
• Gene9c variants (including SNPs) are posi9ons in the genome where people have different values
• Our collec9on of gene9c variants is what makes each of us unique
• Modern techniques make it possible to determine the status of large numbers of SNPs very efficiently
8
8/25/15
5
From the Sample to the Full Genome Sequence
Raw data (FASTq)
Full genome
• Individual diagnosis, personalized medicine
• Sta9s9cs
Deep / ultra-‐deep sequencing
SAM file (aligned reads)
9
Samples Sequencing machine (Illumina, Roche, Life Technology,
Oxford Nanopore, PacBioScience,…)
Threat • Leakage of genomic data • Revela9on of privacy-‐sensi9ve data about the pa9ent – Predisposi9on to disease, ethnicity, paternity or filia9on, etc.
– Denial of access to health insurance, mortgage, educa9on, and employment
• Cross-‐layer amacks – Using privacy-‐sensi9ve informa9on belonging to a vic9m, retrieved from different sources (e.g., online social networks)
10
8/25/15
6
Misconcep9ons about Genome Privacy (1/6)
Misconcep:on 1: Genome privacy is hopeless, because all of us leave biological cells (hair, skin, droplets of saliva,…) wherever we go • Those cells can be collected and used for DNA sequencing • Hence trying to protect genome privacy is a lost bamle • What is wrong with this reasoning? • Collec9ng human biological samples and sequencing them is expensive, illegal, prone to mistakes, and non-‐scalable! (even if sequencing techniques keep improving)
11
Misconcep9ons about Genome Privacy (2/6)
Misconcep:on 2: Genome privacy is irrelevant, because gene9cs is non determinis9c • Gene9c data as such is of limle relevance because other
aspects (especially the environment, nutri9on, etc.) also play a major role in the evolu9on of health
• Hence gene9c data is of limle value for an amacker • What is wrong with this reasoning? • In some cases (e.g., genes BRCA1 and BRCA2 for breast
cancer), the disease probabili9es are highly related to gene9c data
• Paternity can be checked • Environmental data can be obtained from various sources,
including online social networks 12
8/25/15
7
Misconcep9ons about Genome Privacy (3/6) Misconcep:on 3: Genome privacy should be leq to bioinforma9cians • Specialists of bioinforma9cs are trained in both biology (including gene9cs) and computer science • Hence they are bemer prepared than us (computer scien9sts) to address those problems • What is wrong with this reasoning? • Genome privacy requires a strong background in informa9on security (threat analysis, protocol security, cryptography,…) • Such a culture is well developed among computer scien9sts, notably thanks to the challenge of Internet security • Yet, it is not part of the tradi9onal background of bioinforma9cians • Learning the basics of gene9cs is premy straighsorward for computer scien9sts, see e.g. “Evolu9onary Analysis” by Freeman and Herron, 5th edi9on, Pearson, 2013
13
Misconcep9ons about Genome Privacy (4/6)
Misconcep:on 4: Genome privacy will be guaranteed by legisla9on • The usage of gene9c data is strictly regulated, see e.g. the Gene9c Informa9on Nondiscriminatory Act (GINA), 2008, in the US • Legisla9on will act as a deterrent • What is wrong with this reasoning? • If genomic data can be stealthily accessed, poten9al employers, bankers, and other decision makers will be tempted to make use of it (as recruiters do today by checking Facebook profiles of candidates) • Organized criminals (who are rarely deterred by laws) can misuse those data in mul9ple ways (blackmailing,…)
14
8/25/15
8
Misconcep9ons about Genome Privacy (5/6)
Misconcep:on 5: Privacy Enhancing Technologies are a nuisance in the case of gene9cs: gene9c data should be made available online to everyone to facilitate research, as done e.g. in the case of the Personal Genome Project • Medical progress is faster if (anonymized) medical records are freely available online • What is wrong with this reasoning? • Medical confiden9ality is a crucial component of the trust between pa9ent and healthcare provider • If the popula9on becomes scared about leakage of genomic data, a severe backlash on genomics research (and thus personalized medicine) could follow Some9mes, medical researchers tend to underes9mate the constraints of clinical prac9ce…
15
Misconcep9ons about Genome Privacy (6/6)
Misconcep:on 6: Encryp9ng genomic data is superfluous because it is hard to iden9fy a person from her variants • Databases of genomes are usually anonymized • Even in clear text, genomic data are so complicated that it is prac9cally impossible to deanonymize them • What is wrong with this reasoning? • See counter-‐examples hereaqer
16
8/25/15
9
Examples of Recent Research Results
• Deanonymiza9on of genomes • Quan9fica9on of Kin Genomic Privacy • Efficient and Secure Tes9ng of Genomes • Android-‐based GenoDroid Framework • Privacy-‐Preserving Computa9on of Disease Risk by Using Genomic, Clinical, and Environmental Data
17
Smith Smith
M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich, “Iden:fying Personal Genomes by Surname Inference,” Science, Jan. 2013.
Gymrek et al., “Iden%fying Personal Genomes by Surname Inference” 18
www.ysearch.org:
Y
Y
Smith
Smith
Y
Smith
Smith
Smith
8/25/15
10
Two Largest Public Gene9c Genealogy Databases with Built-‐in Search Engines
19
Publicly available 135,000 surname-‐YSTR records…
www.smgf.org www.ysearch.org
Y-‐STR: Short tandem repeat on the Y-‐chromosome (typically used in paternity tests)
Sorenson Molecular Genealogy Founda9on
What is the Likelihood to Recover a Surname?
20
Gymrek et al., “Iden%fying Personal Genomes by Surname Inference”
For US Caucasian males from middle and upper class:
12% successful recoveries
è Morality: Deanonymiza9on of online genomic data is easy (and will become easier)
Empirical test on 900 surname/Y-STRs haplotype records
Y-‐STR of a real person Querying
Ysearch and SMGF
Calcula9ng surname confidence score
Inferring surname
Comparing the predicted surname to the true one
8/25/15
11
Quan9fica9on of Kin Genomic Privacy (CCS 2013)
Correlated gene:c informa:on between family members => an individual sharing his/her genome threatens his (known) rela:ves’ genomic privacy 21
Helping a Family to Decide what to Reveal
GPPM
Adversary’s Background Knowledge Familial rela:onships gathered from social networks or genealogy Websites Reconstruc:on AX
ack (Inference)
Genom
ic-‐Privacy Quan:fica:on
Health-‐Privacy Quan:fica:on
Linkage disequilibrium values Matrix of pairwise joint proba.
Actual genomic sequences Observed genomic sequences Decision
Rules of meiosis
SNP j
SNP i Pij
MAF
qi
AG CT AA GC AT … AC
AG CC AC GC AT … AA
AG CT AA CC TT … AC
X1
X2
XN
…
…
M loci
AG __ AA __ AT … __
__ __ __ __ __ … __
__ CT AA __ __ … AC
X1
X2
XN
…
…
M loci
22
SNP i
MAF: Minor Allele Frequency GPPM: Genomic Privacy Protec9on Mechanism
8/25/15
12
Reconstruc9on Amacks Matrix contains probability distribu9on (of BB, Bb and bb) for known and unknown values of alleles. Ini9aliza9on based on background knowledge
The marginal probabili9es for unknown values are computed by using sum-‐product (belief propaga9on) algorithms (next slide)
Given by a sparse pairwise joint probability matrix L where Li,j = Pr(Xi,Xj)
23
m(k): mother’s allele at SNP k f(k): father’s allele at SNP k
M loci
N rela:ves
Factor Graph Example with a trio (3 individuals) and 3 SNPs in LD
f2 f3 f4 f5 f6 f7 f8 f9 f1
f11 f12 f13 f14 f15 f16 f17 f18 f10
M F
C
P(XC|XM, XF), assuming Mendelian inheritance P(X3) P(X1), given by popula9on
allele frequencies P(X2)
P(X1X2), joint probability given by LD P(X1X3), joint probability given by LD P(X2X3), joint probability given by LD
Pedigree factor nodes
LD factor nodes
=1 =1
=1 =1
mf10-‐v1 mf13-‐v1
mf1-‐v1
mf3-‐v1
=mv1-‐f1 =mv1-‐f3
=mv1-‐f10
=mv1-‐f13
mf-‐v = messages from factors to variables = mul9ply the incoming messages with the factor func9on and sum out the variable to which the message is sent
mv-‐f = messages from variables to factors = mul9ply all incoming messages except the one to which mv-‐f is sent
24
M. Humbert, E. Ayday, JP Hubaux, A. Telen9: Addressing the Concerns of the Lacks Family: Quan9fica9on of Kin Genomic Privacy, CCS 2013
8/25/15
13
Efficient and Secure Tes9ng of Genomes
Recent results [1] offer ini:al steps towards efficient and secure tes:ng on whole genomes
– Privacy: • Individual retains control of own sequenced genome • Tes9ng lab and individual perform a genomic test, with minimal mutual informa9on disclosure:
1. Only test outcome revealed to one or both par:es
2. Individual’s genome remains private
3. Lab keeps its test specifics private
• Fast cryptographic protocols used for secure func9on evalua9on
– Efficiency: • Maximizes pre-‐computa9on • Domain knowledge used to reduce “input size” to cryptographic layer (e.g., by emula9ng
current in-‐vitro tests)
[1] P. Baldi, R. Baronio, E. De Cristofaro, P. Gasti, G. Tsudik. Countering GATTACA: Efficient and Secure Testing of Fully-Sequenced Human Genomes. CCS 2011.!
doctor!or lab!
genome!
individual!
test specifics!
Secure Function
Evaluation!
test result! test result!
• Private Set Intersection (PSI)!• Authorized PSI!• Cardinality-Only PSI!• […]!
Output reveals nothing beyond test result!
• Paternity/Ancestry Testing!• Testing of SNPs/Markers !• Compatibility Testing!• […]!
8/25/15
14
Android-‐based GenoDroid Framework [2]
Data Conversion!
Test Dependent Processing!
Cryptographic Pre-processing!
Secure Computation!
Communication & Pairing!
Sequencing Center!
Desktop!
Smartphone!
Only done once!
[2] E. De Cristofaro, S. Faber, P. Gasti, G. Tsudik. GenoDroid: Are Privacy-Preserving Genomic Tests Ready for Prime Time?. WPES 2012.!
Privacy-‐Preserving Computa9on of Disease Risk by Using Genomic, Clinical, and Environmental Data
Presented yesterday at HealthTech by E. Ayday
Ø Protect the privacy of users’ genomic, clinical and
environmental medical data at a centralized biobank.
Ø Protect the privacy of medical unit’s confidential data.
Ø Allow medical units to perform some computations on the encrypted data in a privacy-preserving fashion.
Ø Allow specialists to access only to the genomic data they need (or they are authorized for).
8/25/15
15
Proposed Solu9on
(i)
DN
A s
am
ple
(i) Clinical and Environmental data
(ii) Encrypted SNPs
(i) Encrypted clin
ical and
environmental d
ata
(iii)
Dis
ea
se
Ris
k
Co
mp
uta
tio
n
CERTIFIED INSTITUTION (CI)
MEDICAL UNIT (MU)
STORAGE AND PROCESSING UNIT (SPU)
PATIENT (P)
E. Ayday, J. L. Raisaro, P. J. McLaren, J. Fellay, and J.-‐P. Hubaux. Privacy-‐Preserving Computa:on of Disease Risk by Using Genomic, Clinical, and Environmental Data. USENIX Security Workshop on Health Informa%on Technologies (HealthTech '13) See also poster at the main conference: “Towards Quan:fying and Preven:ng the Leakage of Genomic Data Using Privacy-‐Enhancing Technologies”
Conclusion • Digital medicine is coming • It will for ever change the landscape of privacy protec9on • Genomics is par9cularly relevant; ongoing huge research effort • Highly sensi9ve data + huge amounts of data + complex
correla9ons between data è Complex field, Big Data • Very few reseachers have addressed the topic of genome
privacy è much more needs to be done in this field!! • More informa9on and pointers:
hmp://sprout.ics.uci.edu/projects/privacy-‐dna/ hmp://lca.epfl.ch/projects/genomic-‐privacy/
• Survey of the topic (available on the lamer website): E. Ayday, E. De Cristofaro, J.-‐P. Hubaux, G. Tsudik ``The Chills and Thrills of Whole Genome Sequencing´´ EPFL-‐REPORT-‐186866, June 2013
30