do not reproduce without permission 1 gerstein.info/talks 1 access to scientific knowledge (a2scik)...
Post on 15-Jan-2016
212 Views
Preview:
TRANSCRIPT
1
Do not reproduce without permission 1 G
ers
tein
.in
fo/t
alk
s
Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists
Mark B GersteinYale (CBB, MBB, CS)
A2K at Yale Law
2006.04.22, 15' in 11:15-13:00
2
Do not reproduce without permission 2 G
ers
tein
.in
fo/t
alk
s
Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists
• Human Genome Analysis
as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation
1) Intimate synchronization between sites
2) Statistical integration of entire datasets
3) Blurring of DBs & Journals
• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet
2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics
3) In absence of social framework, using technology to "protect" information
3
Do not reproduce without permission 3 G
ers
tein
.in
fo/t
alk
s
Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists
• Human Genome Analysis
as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation
1) Intimate synchronization between sites
2) Statistical integration of entire datasets
3) Blurring of DBs & Journals
• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet
2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics
3) In absence of social framework, using technology to "protect" information
4
Do not reproduce without permission 4 G
ers
tein
.in
fo/t
alk
s
Human Genome Project: Database
Science• Signature large-scale science project
Many scientists contributing "facts" to a distributed collection of DBs
Mathematical analysis and annotation on this
• Fusion of Information & Life Genome as the digital source code for
humans Bioinformatics
• Using computation to understand the genome
• Bioscience + CS combination
• Social Framework Cooperative
• International Teams with Altruistic Spirit
• Belief in Data Sharing and Open Software
Competitive
• Desire for "credit" and profit
5
Do not reproduce without permission 5 G
ers
tein
.in
fo/t
alk
s
Rapid growth in DBs in science spurring on DB
science
6
Do not reproduce without permission 6 G
ers
tein
.in
fo/t
alk
s
DB Interoperation & Federated Information Architecture
• Annotation of the human genome involves a massive federation of interoperating servers "Administered" by many
disparate people and groups
• To find out all associated with a particular gene, perform a distributed query over many sites Conventional web links More complex interfaces
7
Do not reproduce without permission 7 G
ers
tein
.in
fo/t
alk
s
Central Hub DBs• Genome browser
giving overview of whole genome (Google Earth)
• GenBank, UniProt, genome.ucsc.edu,PDB
• Unforeseen "power"
8
Do not reproduce without permission 8 G
ers
tein
.in
fo/t
alk
s
Specialized "Boutique" Databases
MolMovDB.org - Molecular detail about individual gene
9
Do not reproduce without permission 9 G
ers
tein
.in
fo/t
alk
s
Example motion: Maltose Binding Protein
10
Do not reproduce without permission 10
Ge
rste
in.i
nfo
/ta
lks
Aspect #1: Intimate Synchronization between Sites,
Propagating Dynamic Annotation
• Grappling with changing coordinates & annotation
• Complex dependencies between sites
Annotation
Sequence 1
Sequence 2
Sequence 3Genes A
Repeats 1
Genes B
Repeats 2
11
Do not reproduce without permission 11
Ge
rste
in.i
nfo
/ta
lks
Dynamic AnnotationEnsembl 18.34
12
Do not reproduce without permission 12
Ge
rste
in.i
nfo
/ta
lks
Dynamic AnnotationSanger 2.3
13
Do not reproduce without permission 13
Ge
rste
in.i
nfo
/ta
lks
Dynamic AnnotationSanger 3.1b
14
Do not reproduce without permission 14
Ge
rste
in.i
nfo
/ta
lks
Aspect #2: Journal Articles as "Annotation" -- Blurring the Boundaries between Papers
and Databases
• How does traditional scientific publishing fit in The journal article as DB annotation
• Towards reading literature with computers Mining text and correlating papers Bulk data in tables
• Towards interacting with DBs as journals Referring as QC Attribution for credit & accountability Timestamping of unchanging entries Citation and history
15
Do not reproduce without permission 15
Ge
rste
in.i
nfo
/ta
lks
Aspect #3:Statistical
Data Integration
• Example predicting gene networks Combine many weak predictors of gene interactions into confident
linkages (simple ex. is intersection) Where mining adds value
• Involves combining in toto disparate heterogeneous, information sources and computing statistical union Download entire DBs and map one onto another
[TopNet.gersteinlab.org]
16
Do not reproduce without permission 16
Ge
rste
in.i
nfo
/ta
lks
Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists
• Human Genome Analysis
as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation
1) Intimate synchronization between sites
2) Statistical integration of entire datasets
3) Blurring of DBs & Journals
• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet
2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics
3) In absence of social framework, using technology to "protect" information
17
Do not reproduce without permission 17
Ge
rste
in.i
nfo
/ta
lks
Impediment #1:Vast Computer Security Costs in the "Wild West" Internet
18
Do not reproduce without permission 18
Ge
rste
in.i
nfo
/ta
lks
Vast difficulty in securing information servers in academia
• Mundane administration — patches• Make building intricate systems for interoperation
difficult, as researchers have to continually check their interfaces for "holes"
• Unique impact on research (vs business) Free and broad dissemination of ideas between labs and
public is hallmark of research. Preserving openness precludes standard security practices
often employed in a corporate or military environment -- e.g. private networks
Academic computer users exhibit great variability, making effective security procedures more difficult
19
Do not reproduce without permission 19
Ge
rste
in.i
nfo
/ta
lks
Impediment #2 -- Clashing cultures: pay-for-use academic publishing
vs. open-source genomics
• Different traditions in academic publishing vs DB world Genome sequence is free but have to pay for article about it!
• Many free text initiatives PubMedCentral.NIH.gov & arXiv.org
• Tricky economics of free text potentially efficient but redistributes dollars in world of academic publishing who pays: readers or writers
20
Do not reproduce without permission 20
Ge
rste
in.i
nfo
/ta
lks
Impediment #3: Absence of social framework for protecting "data"
• Researchers unclear on framework The ambiguity of the present copyright laws governing the
protection of databases creates a situation where researchers are (practically) unclear about their rights to extract and combine data• Putting articles up on sites, "quoting" annotation
Likewise, researchers are unsure how to get "credit" for combined data ("Mash ups")• Disincentive to data integration
• Database owners, unsure of how laws safeguards their information, overprotect their data with licenses and technological mechanisms that impede interoperation.
21
Do not reproduce without permission 21
Ge
rste
in.i
nfo
/ta
lks
Technological safeguards to "protect" data
• Limits on Bulk Downloads & Global Analysis Passwords and IP filtering
• allow the database owner to limit access to specific users and computers
• selectively cut off access to researchers performing bulk calculations.
Data can also be presented piecemeal, in response to a specific user query
Examples
• Incyte Proteome database
• Cellzome database of interactions.
• Databases can be stored in propriety formats Extreme is encryption
• Watermarking adds overt or hidden digital fingerprints Slightly corrupting the data. Not that common in bio-DBs
(but found in British Library).
22
Do not reproduce without permission 22
Ge
rste
in.i
nfo
/ta
lks
Access to Scientific Knowledge (A2sciK) - Practical Issues Relating to it for Scientists
• Human Genome Analysis
as Paradigm of Modern Database Science "Coopetitive" endeavor Aspects of Distributed Annotation via Intricate DB Interoperation
1) Intimate synchronization between sites
2) Statistical integration of entire datasets
3) Blurring of DBs & Journals
• Social Impediments to Database Interoperation1) Vast security costs in the lawless "Wild West" Internet
2) Clashing cultures: pay-for-use academic publishing vs. open-source genomics
3) In absence of social framework, using technology to "protect" information
23
Do not reproduce without permission 23
Ge
rste
in.i
nfo
/ta
lks
Acknowledgements
D GreenbaumJ JunkerS DouglasA SmithM Seringhaus
bioinfo.mbb.yale.edupapers.gersteinlab.org/papers/epublishing
top related