do not reproduce without permission 1 gerstein.info/talks 1 access to scientific knowledge (a2scik)...

23
1 Do not reproduce without permission 1 Gerstein.info/talks Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark B Gerstein Yale (CBB, MBB, CS) A2K at Yale Law 2006.04.22, 15' in 11:15-13:00

Upload: arline-hart

Post on 15-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

1

Do not reproduce without permission 1 G

ers

tein

.in

fo/t

alk

s

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

Mark B GersteinYale (CBB, MBB, CS)

A2K at Yale Law

2006.04.22, 15' in 11:15-13:00

Page 2: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

2

Do not reproduce without permission 2 G

ers

tein

.in

fo/t

alk

s

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

• Human Genome Analysis

as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation

1) Intimate synchronization between sites

2) Statistical integration of entire datasets

3) Blurring of DBs & Journals

• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet

2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics

3) In absence of social framework, using technology to "protect" information

Page 3: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

3

Do not reproduce without permission 3 G

ers

tein

.in

fo/t

alk

s

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

• Human Genome Analysis

as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation

1) Intimate synchronization between sites

2) Statistical integration of entire datasets

3) Blurring of DBs & Journals

• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet

2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics

3) In absence of social framework, using technology to "protect" information

Page 4: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

4

Do not reproduce without permission 4 G

ers

tein

.in

fo/t

alk

s

Human Genome Project: Database

Science• Signature large-scale science project

Many scientists contributing "facts" to a distributed collection of DBs

Mathematical analysis and annotation on this

• Fusion of Information & Life Genome as the digital source code for

humans Bioinformatics

• Using computation to understand the genome

• Bioscience + CS combination

• Social Framework Cooperative

• International Teams with Altruistic Spirit

• Belief in Data Sharing and Open Software

Competitive

• Desire for "credit" and profit

Page 5: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

5

Do not reproduce without permission 5 G

ers

tein

.in

fo/t

alk

s

Rapid growth in DBs in science spurring on DB

science

Page 6: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

6

Do not reproduce without permission 6 G

ers

tein

.in

fo/t

alk

s

DB Interoperation & Federated Information Architecture

• Annotation of the human genome involves a massive federation of interoperating servers "Administered" by many

disparate people and groups

• To find out all associated with a particular gene, perform a distributed query over many sites Conventional web links More complex interfaces

Page 7: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

7

Do not reproduce without permission 7 G

ers

tein

.in

fo/t

alk

s

Central Hub DBs• Genome browser

giving overview of whole genome (Google Earth)

• GenBank, UniProt, genome.ucsc.edu,PDB

• Unforeseen "power"

Page 8: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

8

Do not reproduce without permission 8 G

ers

tein

.in

fo/t

alk

s

Specialized "Boutique" Databases

MolMovDB.org - Molecular detail about individual gene

Page 9: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

9

Do not reproduce without permission 9 G

ers

tein

.in

fo/t

alk

s

Example motion: Maltose Binding Protein

Page 10: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

10

Do not reproduce without permission 10

Ge

rste

in.i

nfo

/ta

lks

Aspect #1: Intimate Synchronization between Sites,

Propagating Dynamic Annotation

• Grappling with changing coordinates & annotation

• Complex dependencies between sites

Annotation

Sequence 1

Sequence 2

Sequence 3Genes A

Repeats 1

Genes B

Repeats 2

Page 11: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

11

Do not reproduce without permission 11

Ge

rste

in.i

nfo

/ta

lks

Dynamic AnnotationEnsembl 18.34

Page 12: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

12

Do not reproduce without permission 12

Ge

rste

in.i

nfo

/ta

lks

Dynamic AnnotationSanger 2.3

Page 13: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

13

Do not reproduce without permission 13

Ge

rste

in.i

nfo

/ta

lks

Dynamic AnnotationSanger 3.1b

Page 14: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

14

Do not reproduce without permission 14

Ge

rste

in.i

nfo

/ta

lks

Aspect #2: Journal Articles as "Annotation" -- Blurring the Boundaries between Papers

and Databases

• How does traditional scientific publishing fit in The journal article as DB annotation

• Towards reading literature with computers Mining text and correlating papers Bulk data in tables

• Towards interacting with DBs as journals Referring as QC Attribution for credit & accountability Timestamping of unchanging entries Citation and history

Page 15: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

15

Do not reproduce without permission 15

Ge

rste

in.i

nfo

/ta

lks

Aspect #3:Statistical

Data Integration

• Example predicting gene networks Combine many weak predictors of gene interactions into confident

linkages (simple ex. is intersection) Where mining adds value

• Involves combining in toto disparate heterogeneous, information sources and computing statistical union Download entire DBs and map one onto another

[TopNet.gersteinlab.org]

Page 16: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

16

Do not reproduce without permission 16

Ge

rste

in.i

nfo

/ta

lks

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

• Human Genome Analysis

as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation

1) Intimate synchronization between sites

2) Statistical integration of entire datasets

3) Blurring of DBs & Journals

• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet

2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics

3) In absence of social framework, using technology to "protect" information

Page 17: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

17

Do not reproduce without permission 17

Ge

rste

in.i

nfo

/ta

lks

Impediment #1:Vast Computer Security Costs in the "Wild West" Internet

Page 18: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

18

Do not reproduce without permission 18

Ge

rste

in.i

nfo

/ta

lks

Vast difficulty in securing information servers in academia

• Mundane administration — patches• Make building intricate systems for interoperation

difficult, as researchers have to continually check their interfaces for "holes"

• Unique impact on research (vs business) Free and broad dissemination of ideas between labs and

public is hallmark of research. Preserving openness precludes standard security practices

often employed in a corporate or military environment -- e.g. private networks

Academic computer users exhibit great variability, making effective security procedures more difficult

Page 19: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

19

Do not reproduce without permission 19

Ge

rste

in.i

nfo

/ta

lks

Impediment #2 -- Clashing cultures: pay-for-use academic publishing

vs. open-source genomics

• Different traditions in academic publishing vs DB world Genome sequence is free but have to pay for article about it!

• Many free text initiatives PubMedCentral.NIH.gov & arXiv.org

• Tricky economics of free text potentially efficient but redistributes dollars in world of academic publishing who pays: readers or writers

Page 20: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

20

Do not reproduce without permission 20

Ge

rste

in.i

nfo

/ta

lks

Impediment #3: Absence of social framework for protecting "data"

• Researchers unclear on framework The ambiguity of the present copyright laws governing the

protection of databases creates a situation where researchers are (practically) unclear about their rights to extract and combine data• Putting articles up on sites, "quoting" annotation

Likewise, researchers are unsure how to get "credit" for combined data ("Mash ups")• Disincentive to data integration

• Database owners, unsure of how laws safeguards their information, overprotect their data with licenses and technological mechanisms that impede interoperation.

Page 21: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

21

Do not reproduce without permission 21

Ge

rste

in.i

nfo

/ta

lks

Technological safeguards to "protect" data

• Limits on Bulk Downloads & Global Analysis Passwords and IP filtering

• allow the database owner to limit access to specific users and computers

• selectively cut off access to researchers performing bulk calculations.

Data can also be presented piecemeal, in response to a specific user query

Examples

• Incyte Proteome database

• Cellzome database of interactions.

• Databases can be stored in propriety formats Extreme is encryption

• Watermarking adds overt or hidden digital fingerprints Slightly corrupting the data. Not that common in bio-DBs

(but found in British Library).

Page 22: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

22

Do not reproduce without permission 22

Ge

rste

in.i

nfo

/ta

lks

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

• Human Genome Analysis

as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation

1) Intimate synchronization between sites

2) Statistical integration of entire datasets

3) Blurring of DBs & Journals

• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet

2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics

3) In absence of social framework, using technology to "protect" information

Page 23: Do not reproduce without permission 1 Gerstein.info/talks 1 Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists Mark

23

Do not reproduce without permission 23

Ge

rste

in.i

nfo

/ta

lks

Acknowledgements

D GreenbaumJ JunkerS DouglasA SmithM Seringhaus

bioinfo.mbb.yale.edupapers.gersteinlab.org/papers/epublishing