ecosystem challenges around data use

16
Ecosystem Challenges Around Data Use Leonid Zhukov

Upload: leonid-zhukov

Post on 30-Oct-2014

104 views

Category:

Science


1 download

DESCRIPTION

Presentation at the panel on data use at CRA 2014

TRANSCRIPT

Page 1: Ecosystem challenges around data use

Ecosystem  Challenges  Around  Data  Use    Leonid  Zhukov  

Page 2: Ecosystem challenges around data use

Ancestry.com  

2

•  World’s  largest  online  family  history  resource  

•  Started  as  a  publishing  company  in  1983,  online  from  1996  

•  2.7  million  worldwide  subscribers  

 

Page 3: Ecosystem challenges around data use

Data  at  Ancestry  

• Historical  records  –  company  acquired  content  collecFons  

• User  created  content:  – Ancestor  profiles  and  family  trees  – Uploaded  photographs  and  stories  

• User  behavior  data  on  Ancestry.com  

• Customer  DNA  data  

•  10  PB  of  structured  and  unstructured  data  

3

Page 4: Ecosystem challenges around data use

Historical  records  

• Historical  Content  – 14  billion  historical  records  going  back  to  17th  century  – DigiFzed  and  searchable  

4

Page 5: Ecosystem challenges around data use

Historical  records  

5

•  More  than  30,000  content  collecFons  

Page 6: Ecosystem challenges around data use

User  family  trees  

6

•  Family  trees:  – 60  million  family  trees  – 6  billion  profiles  

Page 7: Ecosystem challenges around data use

Family  trees  

7 7

Power  law  distribuFon    tree  sizes  

500  nodes  700  edges  

55  generaFons      

Fme  

Page 8: Ecosystem challenges around data use

User  contributed  content  

– 200  million  uploaded    family  photos  and  stories  

8

Page 9: Ecosystem challenges around data use

Person  and  record  search  

9

•  Search  query  

Page 10: Ecosystem challenges around data use

Record  linkage  

10

•  Record  linkage  –  finding  and  matching  records  in  mulFple  data  sets    with  non-­‐unique  idenFfiers  (data  matching,    enFty  disambiguaFon,  duplicate  detecFon  etc)  

•  Goal:  bring  together  informaFon  about  the  same  person  

•  Some    non-­‐unique  idenFfiers:  –  Names:  first  name,  last  name  (John  Smith  –  300,000  records)  –  Dates:    date  of  birth,  date  of  death        –  Places:  place  of  birth,  residence,  place  of  death    –  Extra:  family  members,  life  events  

•  Records  o_en  incomplete  and  contain  mistakes  

•  Other  industries:  banking,  insurance,  government  etc  

 

Page 11: Ecosystem challenges around data use

User  behavior  data  

• User  behavior  data:  – 75  mln  searches  daily  – 10  mln  profiles  added  daily  – 3.5  mln  records  aaached  daily  

11

Page 12: Ecosystem challenges around data use

DNA  Data  

• Direct  to  consumer  DNA  test  

•  700,000  SNPs  per  sample  

•  400,000  DNA  samples  

• No  medical  studies  

 

 

12

Page 13: Ecosystem challenges around data use

Ancestry  DNA                

• GeneFc  ethnicity  – Reference  panel    – 26  ethnic  regions,  3000  samples  

 

13

Page 14: Ecosystem challenges around data use

Ancestry  DNA  

14

• GeneFc  inheritance  –  IdenFty-­‐by-­‐descent  – Cousin  matching    

 Matching DNA

Page 15: Ecosystem challenges around data use

DNA  data:  privacy  and  research  

15

183

Interest in understanding howgenetic variations influenceheritable diseases and the re-

sponse to medical treatments isintense. The academic communi-ty relies on the availability ofpublic databases for the distribu-tion of the DNA sequences andtheir variations. However, likeother types of medical informa-tion, human genomic data are pri-vate, intimate, and sensitive.Genomic data have raised specialconcerns about discrimination,stigmatization, or loss of insur-ance or employment for individu-als and their relatives (1, 2).Public dissemination of these data posesnonintuitive privacy challenges.

Unrelated persons differ in about 0.1%of the 3.2 billion bases in their genomes(3). Now, the most widely used forms offorensic identification rely on only 13 to15 locations on the genome with variablerepeats (4, 5). Single nucleotide polymor-phisms (SNPs) contain information thatcan be used to identify individuals (5, 6). Ifsomeone has access to individual geneticdata and performs matches to public SNPdata, a small set of SNPs could lead to suc-cessful matching and identification of theindividual. In such a case, the rest of thegenotypic, phenotypic, and other informa-tion linked to that individual in publicrecords would also become available.

The world population is roughly 1010.Specifying DNA sequence at only 30 to 80statistically independent SNP positions willuniquely define a single person (7). Further-more, if some of those positions have SNPsthat are relatively rare, the number that needto be tested is much smaller. If informationabout kinship exists, a few positions will con-firm it. Thus, the transition from private toidentifiable is very rapid (see the figure).

Tension between the desire to protectprivacy and the need to ensure access to sci-

entific data has led to a search for new tech-nologies. However, the hurdles may begreater than had been suspected. For exam-ple, one approach to protecting privacy is tolimit the amount of high-quality data re-leased and randomly to change a small per-centage of SNPs for each subject in thedatabase (8). Suppose that 10% of SNPs arerandomly changed in a sequence of DNA, afairly major obfuscation that would notplease many genetics researchers. Our esti-mates (7) show that measuring as few as 75statistically independent SNPs would de-fine a small group that contained the realowner of the DNA. Disclosure controlmethods such as data suppression, dataswapping, and adding noise would be unac-ceptable by similar arguments.

A second approach is to group SNPs into bins. Disregarding exact genomic lo-cations of SNPs increases the number ofrecords that share the same values, thus in-creasing confidentiality. Our calculations(7) show that such strategies do not protectprivacy, because the pattern of binned val-ues is unlikely to match anyone other thanthe owner of the DNA. Data analysis wouldbe greatly complicated by binning, and theinformation content would be severely re-duced or even eliminated.

Until technological innovations appear,solutions in policy and regulations must befound. We are building the Pharmaco-genetics and Pharmacogenomics KnowledgeBase (8, 9), which contains individual geno-type data and associated phenotype infor-

mation. No genetic data will be providedunless a user can demonstrate that he or sheis associated with a bona fide academic, in-dustrial, or governmental research unit andagrees to our usage policies (including auditof data access) (10). Although this does notprevent data abuse, it provides a way tomonitor usage.

Social concerns about privacyare intricately connected to beliefsabout benefits of research andtrustworthiness of researchers andgovernmental agencies. In theUnited States, the Health InsurancePortability and Accountability Actof 1996 (HIPAA) and the associat-ed Privacy Rules of 2003 (11) gen-erally forbid sharing identifiabledata without patient consent.However, they do not specificallyaddress use or disclosure policiesfor human genetic data. Recent de-bates in Iceland, Estonia, Britain,and elsewhere (12–15), reveal arange of views on the threats posed

by genetic information. The United Statesmay be at one end of this spectrum, as its cit-izens seem to strongly desire health privacy.Whatever the setting, we recommend explic-it clarifications to rules and legislation (suchas HIPAA), so that they explicitly protect ge-netic privacy and set strong penalties for vio-lations. These clarifications should defineentities authorized to use and exchange hu-man genetic data and for what purposes.

References and Notes 1. M. R. Anderlik, M. A. Rothstein, Annu. Rev. Genomics

Hum. Genet. 2, 401 (2001).2. P. Sankar, Annu. Rev. Med. 54, 393 (2003).3. W. H. Li, L. A. Sadler, Genetics 129,513 (1991).4. L. Carey, L. Mitnik, Electrophoresis 23, 1386 (2002).5. H. D. Cash et al., Pac. Symp. Biocomput. 2003, 638

(2003).6. National Commission on the Future of DNA Evidence,

The Future of Forensic DNA Testing: Predictions ofthe Research and Development Working Group(National Institute of Justice, U.S. Department ofJustice, Washington, DC, 2000).

7. See supporting online material for further discussion.8. L. C. R. J. Willenborg, T. D. Waal Elements of Statistical

Disclosure Control (Springer, New York, 2001).9. T. E. Klein et al., Pharmacogenomics J. 1, 167 (2001).

10. www.pharmgkb.org/home/policies/index.jsp11. Fed. Regist. 67, 53181 (2002).12. R. Chadwick, BMJ 319, 441 (1999).13. L. Frank, Science 290, 31 (2000).14. M. A. Austin et al., Genet. Med. 5, 451 (2003).15. V. Barbour, Lancet 361, 1734 (2003).16. Supported in part by NIH/NLM Biomedical Infor-

matics Training Grant LM007033 (Z.L.), NSF GrantDMS-0306612 (A.B.O.), and the NIH/NIGMS Pharma-cogenetics Research Network and Database U01-GM61374 (R.B.A). We thank J. T. Chang, B. T.Naughton, T. E. Klein, and reviewers.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/305/5681/183/DC1

G E N E T I C S

Genomic Research andHuman Subject Privacy

Zhen Lin,1 Art B. Owen,2 Russ B. Altman1*

1Department of Genetics, Stanford University Schoolof Medicine, CA 94305–5120, USA. 2Department ofStatistics, Stanford University, CA 94035–4065, USA.

*To whom correspondence should be addressed. E-mail: [email protected]

POLICY FORUM

Priv

acy

Independent SNPs

Low

Medium

High

5 75 100 125 1000 2000 3000 4000

Insufficient for future genomic research

Insufficient for privacy protection

Needed to find genetic relationshops

Trade-offs between SNPs and privacy.

www.sciencemag.org SCIENCE VOL 305 9 JULY 2004

on

July

9, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

July

9, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Z.  Lin,  A.  Owen,  R.  Altman,  Science,  vol  305,  2004  

Page 16: Ecosystem challenges around data use

Challenges  

•  Engineering  – Scalability  – Availability  – Security  

• Research  –  InformaFon  retrieval    – DNA  genomic  research    

•  Privacy    

 16