do not reproduce without permission 1 gerstein.info/talks (c) 2008 a vision for using all the data...

56
Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to Study the Structure of Science Mark B Gerstein Yale (Comp. Bio. & Bioinformatics) NSF Workshop on Knowledge Management and Visualization Tools 2008.03.11, 09:30-10:00 Slides downloadable from Lectures.GersteinLab.org (Please read permissions statement.) (Textmining talk, fits into ~30' w. interrupting questions) 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu

Upload: olivia-andrews

Post on 31-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 1 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

A Vision for Using All the Data and Publications from Science on Web:

Mining this to Study the Structure of Science

Mark B GersteinYale (Comp. Bio. & Bioinformatics)

NSF Workshop on Knowledge Management and Visualization Tools

2008.03.11, 09:30-10:00Slides downloadable from Lectures.GersteinLab.org

(Please read permissions statement.)

(Textmining talk, fits into ~30' w. interrupting questions)

1

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Page 2: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 2 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

Rapid growth in DBs in science,

changing the langscape

2

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

[Gre

enba

um &

Ger

stei

n, N

at.

Bio

tech

. ('0

3)]

Page 3: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 3 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

Part of the Changing Landscape: Situation Facing DBs and Journals

• Distinctions Blurring Reading Journals via queries

• Reading DB entries Towards reading literature with computers

• Mining text and correlating papers

• Biology as a science of heterogeneous facts Well-suited to database storage

3

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

[Gerstein, Bioinformatics ('99); Gerstein & Junker. Nature Yearbook ('02)]

Page 4: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 4 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

4

Conventional Challenge

• Hard to keep up with volume and growth of publications

• Missed opportunities in connections between fields• Harness the power of technology to help scientists

share information?

New Vision

• Discover new scientific relationships• Study the Structure of Science itself

Page 5: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 5 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

[Rzhetsky et al, Cell ('08,

submitted)]

Overall Process of Web Mining

.

Fig. removed until paper in press

Page 6: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 6 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

6

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

[Rzhetsky et al, Cell ('08,

submitted)]

Overall Process of Web Mining

Digest Texts into Simpler

Machine Understand-

able Form and then

Synthesize

Page 7: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 7 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

7

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

[Rzhetsky et al, Cell ('08,

submitted)]

Overall Process of Web Mining

Doing better science: Finding new protein relationships (e.g.

protein interactions), looking for inconsist-encies in arguments, assembling consen-

sus definitions automatically

Krauthammer et al.Molecular triangulation: bridging linkage and molecular-network

information for identifying candidate

genes in Alzheimer's disease. PNAS ('04); Iossifov et al. Probabilistic inference of molecular networks from

noisy data sources.

Bioinformatics ('04)

Page 8: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 8 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

8

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

[Rzhetsky et al, Cell ('08,

submitted)]

Overall Process of Web Mining

Mapping Science

+Studying its Dynamics &

Evolution

Page 9: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 9 G

ers

tein

.in

fo/t

alk

s

(c)

20

08

9

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

[Rzhetsky et al, Cell ('08,

submitted)]

Overall Process of Web Mining

• Revealing patterns of collaboration• Understanding basis of terms & nomenclature•Tracking the evolution of ideas• Models for the evolution of science; • Helping set policy & research directions

Page 10: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 10

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

81

0

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

[Rzhetsky et al, Cell ('08,

submitted)]

Overall Process of Web Mining

Making it understand-

able (through “mashup”)

Page 11: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 11

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Examples Illuminating Current State of Affairs:

Mining Simple Term Occurrence

Statistics to Understand and Justify Directions in Science

Page 12: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 12

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Over-representation of crystallography among the Nobel Prizes, highlighted

by the 2006 Nobels

[Seringhaus & Gerstein, Science (2007)

Page 13: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 13

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

81

3

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

The current state of mammalian gene annotation: a rationale for data driven research

p53

TNF

VEGF

EGFRERNFKB

TGFBIL-6

COX2BCL2

Adapted from Su and Hogenesch, Genome Biology, 2007

Page 14: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 14

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Gene Name Skew

14

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Seringhaus et al. GenomeBiology (2008)]

Page 15: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 15

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Ex. Naming Issue: Starry Night

Starry night (P Adler, ’94)

15

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Seringhaus et al. GenomeBiology (2008)]

Page 16: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 16

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Naming Pathologies:

Related to Single Genes

16

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Seringhaus et al. GenomeBiology (2008)]

(b) drop dead: flies with mutations in drop dead die rapidly after their brain rapidly deteriorates. (c) malvolio: gene needed for normal taste behaviour. Malvolio in Shakespeare's Twelfth Night tasted "with distempered appetite". (d) LOV: light, oxygen, or voltage (LOV) family of blue-light photoreceptor domains. (e) yuri: this gene was discovered on the anniversary of Yuri Gagarin's space flight. Mutants have problems with gravitaxis and cannot stay aloft. (f) tribbles: cells divide uncontrollably, like the eponymous Star Trek characters. (g) kuzbanian: mutants have uncontrollable bristle growth. Koozbanians are alien Muppets with uncontrollable hair growth; spelling was changed to avoid copyright infringement. (h) ring: really interesting new gene. (i) yippee: a graduate student’s reaction on cloning the gene

Page 17: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 17

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Naming Pathologies:

Involving Multiple Gene Names

17

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Seringhaus et al. GenomeBiology (2008)]

(j) kryptonite and superman: the kryptonite mutation suppresses the function of the SUPERMAN gene. (k) arleekin, valient, tungus: mutations in arleekin, valient, tungus and 29 other genes affect long-term memory. Named after Pavlov's dogs. (l) PKD1 (human) and lov-1 (worm): these are homologs, although their names do not suggest it. (m) MT-1: this label can refer to at least 11 different human genes. (n) BAF45 and BAF47: names for the same gene, reflecting a revision of the molecular weight of product.

Page 18: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 18

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Examples Illuminating Current State of Affairs: Using Network Representations to Make Maps of Science -- Studying

the Publication Patterns of Genomics Consortia

18

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]

Page 19: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 19

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Co-Authorship Publication

Network of Struc. Genomics

Consortia (NESG)

19

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]

Page 20: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 20

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Co-authorship Networks

comparing the 9 NIH Structural Genomics

Centers

20

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

(45)

(19)

(7)

AverageDegree

[Douglas et al. GenomeBiol. ('05),

pubnet.gersteinlab.org]

Page 21: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 21

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Different Representations of Publication Network of NESG

21

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Dou

glas

et

al.

Gen

omeB

iol.

('05)

, pu

bnet

.ger

ste

inla

b.or

g]

Page 22: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 22

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Clustering structures determined

by struc. genomics consortia

according to functional similarity: Is there a functional

bias in consortia

structures?

22

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Avg. Degree

Avg. Path Clust. Coeff.

Diameter

PSI 24 2.6 37% 7

PDB 6 3.9 31% 9[Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]

Page 23: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 23

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Making Larger Maps:

Mapping the whole field

of Melanoma Research

[Boyack, Kevin W., Mane, Ketan and Börner, Katy. (2004). Mapping Medline Papers, Genes, and Proteins Related to Melanoma Research. IV2004 Conference, London, UK, pp. 965-971. ]

Page 24: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 24

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Backbone Map of Science & Soc. Sci.

[http://grants.nih.gov/grants/KM/OERRM/OER_KM_events/Borner.pdfBoyack, Kevin W., Klavans, R. and Börner, Katy. (2005). Mapping the Backbone of Science. Scientometrics. 64(3), 351-374.40]

Page 25: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 25

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

25

Ranking Journal Influence - Eigenfactor.org

“Ranks journals much as Google ranks websites.”

Adjusts for citation differences among disciplines

Ranking of journals in computer science (top of list).

Page 26: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 26

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Examples Illuminating Current State of Affairs: Analyzing the Dynamics of Science

Page 27: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 27

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

'Omics terms over the years

27

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Pub

Med

Hits

Proteome

[Greenbaum et al. Gen. Res. ('01)]

Page 28: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 28

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

28

Evolution of Science

Map based on “bursty” words in life sciences publications since 1980.Older fundamental research (center) led to four different areas (subgraphs in corners).

This and other domain maps at http://www.scimaps.org.

•K. Mane, K. Börner (2004). Mapping topics and topic bursts in PNAS. 101 (Supplement 1): 5287.

Page 29: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 29

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

RNAi:Birth of a Field in

the Literature Culmin-ating in the 2006

Nobel

29

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Source:Gerstein & Douglas.

PLoS Comp. Bio. 3:e80 (2007)

PubNet.GersteinLab.org

Page 30: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 30

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

30

The Social Dynamics of Innovation and Scientific Discovery:

Making Models of Science

The population dynamics of authors in an emerging field is well described by models similar to those of epidemics, but that take into account contact processes and intentionality characteristic of human social dynamics. Panel (a) shows a SEIRZ model, (b) its best solution applied to the spread of Feynaman diagrams in the USA, Japan and the Soviet Union, and (c) details the model parameter's interpretation. The spread of ideas is characterized by relatively low contact rates (compared to infectious diseases), and very long lifetimes for the idea, as well as intentional strtuctures to promote interaction between individuals during the learning process.

The Patterns of Discovery and the Spread of Ideas as Epidemics.

[Bettencourt et al., Physica A (2006)]

Page 31: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 31

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Examples Illuminating Current State of Affairs:

Mashing up the Text from Scientific Publications with other information

sources to make Science more Understandable

• Mashing up scientific texts with streamed video, genome annotation, protein structure & interactions

• SciVee http://www.scivee.tv Partnership: NSF, PloS, San Diego Supercomputing Center Pubcasts—video correlated with PLoS papers automatically displayed as video runs Videos—scientists upload their own without papers

• Journal of Visualized Experiments (JoVE) http://www.jove.com Monthly issues of theme-related videos Procedure walk-throughs, interviews High-quality video and sound

Page 32: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 32

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

32

SciVee

Page 33: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 33

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

33

JoVE

Page 34: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 34

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Fusing Data & Papers to Annotate the Genome

• Ideal project for 21st century is annotating every base of the genome Want to attach all publications and results to the genome "Fly through Genome" as way to access and understand the

literature

• Problem of a good browser....

[Gerstein, Science ('00), Nature ('06)]

Page 35: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 35

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

We need a Google Earth for the Genome; A Step in this Direction...

[Herr, Holloway, Börner, Emergent Mosaic of Wikipedian Activity, 2007]

Page 36: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 36

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Impediments to the Vision

36

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Page 37: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 37

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

DB Interoperation & Federated Information Architecture

• Need to perform a “distributed query” over many sites Conventional web links More complex interfaces

• Annotation of the human genome involves a massive federation of interoperating servers "Administered" by many

disparate people and groups

37

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Smith et al., BMC Bioinfo. ('07)]

Page 38: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 38

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Impediment #1: Structuring the

Information Correctly for Large-

scale Query

38

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Page 39: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 39

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Structuring the Information

39

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Descriptive Name: Elongation Factor 2

Summary sentence describing function:

This protein promotes the GTP-dependent

translocation of the nascent protein chain from the A-site to the P-site of

the ribosome.

EF2_YEAST

Lots of references to papers

Curated DB entries in Uniprot vs Unstructured scientific text

Structured Semantic Web [Berners-Lee et al, Sci. Am. (2001)] vs purely unstructured text mining

[Smith & Gerstein, Science ('06), Tech Rev. (Jul. '07)]

Page 40: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 40

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Other Issues with the Current Situtation between DBs & Journals

• Not always a clear linkage between papers & DBs Keeping entries in DB and paper in sync

• Data aliquot Huge datasets are handled but what of isolated facts

• How to connect key attributes of Journals with DBs Attribution for credit & accountability Time stamping of unchanging entries Citation and history Well worked out process of QC via refereeing and editing

• Readability of Papers Detailed data embedded into papers, making text hard to read

40

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Gerstein, Bioinformatics ('99); Gerstein & Junker. Nature Yearbook ('02)]

Page 41: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 41

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Structured Abstract Proposal as a Compromise

• Storing information in papers in machine interpretable fashion for automatic deposition into DBs Abstract + standardized view of all tables

• Cross-referencing it with a specific part of the global genome, proteome, and interactome Article written as annotation from the start

• Done in parallel to submission & revision of normal journal article Refereed & edited normally Capitalizes on peer review & incentives to publish

• Curators vs editors Author is in control and this process But it’s officiated by referees and editors

41

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Seringhaus & Gerstein, FEBS ('08); Gerstein et al., Nature ('07)]

Page 42: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 42

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Ex. Structured Abstract

42

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Ser

ingh

aus

& G

erst

ein,

BM

C B

ioin

form

atic

s (2

007)

]

Page 43: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 43

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Ex. Structured Abstract

• K.lactis (species) KlSTE4 (gene)

• KlSte4p (protein)– CLONED

» Available at …– SEQUENCED

» Sequence ATGTACGCTATAGGC….– MUTANTS

» DELETION» FUNCTIONAL ASSAYS» Sterile in both MATa and MATα» No defect in vegetative growth» STRAIN INFORMATION» Available at….

– INTERACTIONS» TWO-HYBRID» KlGpa1p (10x stronger) = XXX» Control (no partner) = XXX» KlGpa1p* = XXX» KlGpa2p = XXX» ScGpa1p = XXX (S. cerevisiae)

– COMMENTS» Both KlSte4p and KlGpa1p required to

induce mating in K.lactis

KlGPA1 (gene)• KlGpa1p (protein)

– INTERACTIONS» TWO-HYBRID» KlSte4 = XXX

• KlGpa1p* (protein)– INTERACTIONS

» TWO-HYBRID» KlSte4 = XXX

KlGPA2 (gene)• KlGpa2p (protein)

– INTERACTIONS» TWO-HYBRID» KlSte4 = XXX

• S.cerevisiae (species) SCGPA1 (gene)

• ScGpa1p (protein)– INTERACTIONS

» TWO-HYBRID» KlSte4 = XXX

43

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Ser

ingh

aus

& G

erst

ein,

BM

C B

ioin

form

atic

s (2

007)

]

Page 44: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 44

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Unsupervised Textmining vs Manually Curated and Structured

Documents: Not necessarily a conflict

• Structured abs. might be good training sets for mining

• Also, gateway to mining

44

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Smith et al., Bioinformatics ('07)]

Page 45: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 45

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Impediment #2: Access Restrictions Inhibit Large-scale

Query

45

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Page 46: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 46

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Absence of social framework for protecting "data" on the web

• Researchers unclear on framework The ambiguity of the present copyright laws governing the

protection of databases creates a situation where researchers are (practically) unclear about their rights to extract and combine data• Putting articles up on sites, "quoting" annotation

Likewise, researchers are unsure how to get "credit" for combined data ("Mash ups")• Disincentive to data integration

• Information owners, unsure of how laws safeguards their information, overprotect their data with licenses and technological mechanisms that impede interoperation.

46

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Greenbaum & Gerstein, Nat. Biotech. ('03)]

Page 47: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 47

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Technological safeguards to "protect" data

• Limits on Bulk Downloads & Global Analysis Passwords and IP filtering

• allow the database owner to limit access to specific users and computers

• selectively cut off access to researchers performing bulk calculations.

Data can also be presented piecemeal, in response to a specific user query

Examples

• Incyte Proteome database

• Cellzome database of interactions.

• Databases can be stored in propriety formats Extreme is encryption

• Watermarking adds overt or hidden digital fingerprints Slightly corrupting the data. Not that common in bio-DBs

(but found in British Library).

47

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Greenbaum & Gerstein, Nat. Biotech. ('03)]

Page 48: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 48

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Free text Issue is Part of this Larger Context

• Different traditions in academic publishing vs DB world Genome sequence is free but have to pay for article about it!

• Many free text initiatives PubMedCentral.NIH.gov & arXiv.org

• Tricky economics of free text potentially efficient but redistributes dollars in world of academic publishing who pays: readers or writers

48

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Greenbaum et al. (2003) Interdiscip Sci Rev 28: 293-302.]

Page 49: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 49

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Impediment #3: Security

Considerations Inhibit Large-scale

Query

49

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Page 50: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 50

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Vast Computer Security Costs in the "Wild West" Internet

50

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Gre

en

ba

um

et

al.,

Na

t. B

iote

ch.

('04

); S

mith

et

al.,

Ge

no

me

Bio

l. ('0

5)

Page 51: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 51

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

Vast difficulty in securing information servers in academia

• Mundane administration — patches• Make building intricate systems for interoperation

difficult, as researchers have to continually check their interfaces for "holes"

• Unique impact on research (vs business) Free and broad dissemination of ideas between labs and

public is hallmark of research. Preserving openness precludes standard security practices

often employed in a corporate or military environment -- e.g. private networks

Academic computer users exhibit great variability, making effective security procedures more difficult

51

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

[Gre

en

ba

um

et

al.,

Na

t. B

iote

ch.

('04

); S

mith

et

al.,

Ge

no

me

Bio

l. ('0

5)

Page 52: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 52

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

8

A Vision for Harnessing the Volume of Information on the Web to Study the Structure of Science

• Main Applications of Large-scale Mining New Scientific Discoveries

(not disc. here) Understanding Areas of Study through

Simple Zipf Stats

• Crystallography Nobel, Genomics, Gene Naming

Maps of Science

• Studying a genomics consortia, Bigger Map, Ranking Journals

Dynamics of Science

• Watching and modeling the appearance of new terms, RNAi ex.

• Impediments Large-scale Mining (as Distributed Query) (Semi) Structuring

the Information in Journals

Overcoming access restrictions

Security Considerations

Page 53: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 53

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

85

3

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

MG

MS

AcknowledgementsAcknowledgements

TopNet.GersteinLab.org

Page 54: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 54

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

85

4

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Acknowledgements

TopNet.GersteinLab.org

MS

MG

Page 55: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 55

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

85

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Acknowledgements

TopNet.GersteinLab.org

MS

MG

Page 56: Do not reproduce without permission 1 Gerstein.info/talks (c) 2008 A Vision for Using All the Data and Publications from Science on Web: Mining this to

Do not reproduce without permission 56

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

85

6

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Acknowledgements

TopNet.GersteinLab.org

MS

MG

A Smith

K Yip

K Cheung

S Douglas

NIH, NSF, Keck

Job opportunities currently for postdocs &

students

G Montelione

M Seringhaus

D Greenbaum

M Schultz

P Cayting