do not reproduce without permission 1 gerstein.info/talks 1 access to scientific knowledge (a2scik)...

Post on 15-Jan-2016

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Do not reproduce without permission 1 G

ers

tein

.in

fo/t

alk

s

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

Mark B GersteinYale (CBB, MBB, CS)

A2K at Yale Law

2006.04.22, 15' in 11:15-13:00

2

Do not reproduce without permission 2 G

ers

tein

.in

fo/t

alk

s

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

• Human Genome Analysis

as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation

1) Intimate synchronization between sites

2) Statistical integration of entire datasets

3) Blurring of DBs & Journals

• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet

2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics

3) In absence of social framework, using technology to "protect" information

3

Do not reproduce without permission 3 G

ers

tein

.in

fo/t

alk

s

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

• Human Genome Analysis

as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation

1) Intimate synchronization between sites

2) Statistical integration of entire datasets

3) Blurring of DBs & Journals

• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet

2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics

3) In absence of social framework, using technology to "protect" information

4

Do not reproduce without permission 4 G

ers

tein

.in

fo/t

alk

s

Human Genome Project: Database

Science• Signature large-scale science project

Many scientists contributing "facts" to a distributed collection of DBs

Mathematical analysis and annotation on this

• Fusion of Information & Life Genome as the digital source code for

humans Bioinformatics

• Using computation to understand the genome

• Bioscience + CS combination

• Social Framework Cooperative

• International Teams with Altruistic Spirit

• Belief in Data Sharing and Open Software

Competitive

• Desire for "credit" and profit

5

Do not reproduce without permission 5 G

ers

tein

.in

fo/t

alk

s

Rapid growth in DBs in science spurring on DB

science

6

Do not reproduce without permission 6 G

ers

tein

.in

fo/t

alk

s

DB Interoperation & Federated Information Architecture

• Annotation of the human genome involves a massive federation of interoperating servers "Administered" by many

disparate people and groups

• To find out all associated with a particular gene, perform a distributed query over many sites Conventional web links More complex interfaces

7

Do not reproduce without permission 7 G

ers

tein

.in

fo/t

alk

s

Central Hub DBs• Genome browser

giving overview of whole genome (Google Earth)

• GenBank, UniProt, genome.ucsc.edu,PDB

• Unforeseen "power"

8

Do not reproduce without permission 8 G

ers

tein

.in

fo/t

alk

s

Specialized "Boutique" Databases

MolMovDB.org - Molecular detail about individual gene

9

Do not reproduce without permission 9 G

ers

tein

.in

fo/t

alk

s

Example motion: Maltose Binding Protein

10

Do not reproduce without permission 10

Ge

rste

in.i

nfo

/ta

lks

Aspect #1: Intimate Synchronization between Sites,

Propagating Dynamic Annotation

• Grappling with changing coordinates & annotation

• Complex dependencies between sites

Annotation

Sequence 1

Sequence 2

Sequence 3Genes A

Repeats 1

Genes B

Repeats 2

11

Do not reproduce without permission 11

Ge

rste

in.i

nfo

/ta

lks

Dynamic AnnotationEnsembl 18.34

12

Do not reproduce without permission 12

Ge

rste

in.i

nfo

/ta

lks

Dynamic AnnotationSanger 2.3

13

Do not reproduce without permission 13

Ge

rste

in.i

nfo

/ta

lks

Dynamic AnnotationSanger 3.1b

14

Do not reproduce without permission 14

Ge

rste

in.i

nfo

/ta

lks

Aspect #2: Journal Articles as "Annotation" -- Blurring the Boundaries between Papers

and Databases

• How does traditional scientific publishing fit in The journal article as DB annotation

• Towards reading literature with computers Mining text and correlating papers Bulk data in tables

• Towards interacting with DBs as journals Referring as QC Attribution for credit & accountability Timestamping of unchanging entries Citation and history

15

Do not reproduce without permission 15

Ge

rste

in.i

nfo

/ta

lks

Aspect #3:Statistical

Data Integration

• Example predicting gene networks Combine many weak predictors of gene interactions into confident

linkages (simple ex. is intersection) Where mining adds value

• Involves combining in toto disparate heterogeneous, information sources and computing statistical union Download entire DBs and map one onto another

[TopNet.gersteinlab.org]

16

Do not reproduce without permission 16

Ge

rste

in.i

nfo

/ta

lks

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

• Human Genome Analysis

as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation

1) Intimate synchronization between sites

2) Statistical integration of entire datasets

3) Blurring of DBs & Journals

• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet

2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics

3) In absence of social framework, using technology to "protect" information

17

Do not reproduce without permission 17

Ge

rste

in.i

nfo

/ta

lks

Impediment #1:Vast Computer Security Costs in the "Wild West" Internet

18

Do not reproduce without permission 18

Ge

rste

in.i

nfo

/ta

lks

Vast difficulty in securing information servers in academia

• Mundane administration — patches• Make building intricate systems for interoperation

difficult, as researchers have to continually check their interfaces for "holes"

• Unique impact on research (vs business) Free and broad dissemination of ideas between labs and

public is hallmark of research. Preserving openness precludes standard security practices

often employed in a corporate or military environment -- e.g. private networks

Academic computer users exhibit great variability, making effective security procedures more difficult

19

Do not reproduce without permission 19

Ge

rste

in.i

nfo

/ta

lks

Impediment #2 -- Clashing cultures: pay-for-use academic publishing

vs. open-source genomics

• Different traditions in academic publishing vs DB world Genome sequence is free but have to pay for article about it!

• Many free text initiatives PubMedCentral.NIH.gov & arXiv.org

• Tricky economics of free text potentially efficient but redistributes dollars in world of academic publishing who pays: readers or writers

20

Do not reproduce without permission 20

Ge

rste

in.i

nfo

/ta

lks

Impediment #3: Absence of social framework for protecting "data"

• Researchers unclear on framework The ambiguity of the present copyright laws governing the

protection of databases creates a situation where researchers are (practically) unclear about their rights to extract and combine data• Putting articles up on sites, "quoting" annotation

Likewise, researchers are unsure how to get "credit" for combined data ("Mash ups")• Disincentive to data integration

• Database owners, unsure of how laws safeguards their information, overprotect their data with licenses and technological mechanisms that impede interoperation.

21

Do not reproduce without permission 21

Ge

rste

in.i

nfo

/ta

lks

Technological safeguards to "protect" data

• Limits on Bulk Downloads & Global Analysis Passwords and IP filtering

• allow the database owner to limit access to specific users and computers

• selectively cut off access to researchers performing bulk calculations.

Data can also be presented piecemeal, in response to a specific user query

Examples

• Incyte Proteome database

• Cellzome database of interactions.

• Databases can be stored in propriety formats Extreme is encryption

• Watermarking adds overt or hidden digital fingerprints Slightly corrupting the data. Not that common in bio-DBs

(but found in British Library).

22

Do not reproduce without permission 22

Ge

rste

in.i

nfo

/ta

lks

Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists

• Human Genome Analysis

as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation

1) Intimate synchronization between sites

2) Statistical integration of entire datasets

3) Blurring of DBs & Journals

• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet

2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics

3) In absence of social framework, using technology to "protect" information

23

Do not reproduce without permission 23

Ge

rste

in.i

nfo

/ta

lks

Acknowledgements

D GreenbaumJ JunkerS DouglasA SmithM Seringhaus

bioinfo.mbb.yale.edupapers.gersteinlab.org/papers/epublishing

top related