nesc, 25 april 2002 why dont scientists use databases? 1 why dont scientists use databases? peter...

31
NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 1 Why Don’t Scientists Use Why Don’t Scientists Use Databases? Databases? Peter Buneman Peter Buneman Division of Informatics Division of Informatics University of Edinburgh University of Edinburgh Digital Libraries grant IIS 98-17444 (NSF,DARPA,NLM, LoC,NEH, NASA) http://db.cis.upenn.edu http://db.cis.upenn.edu/Research/ provenance.html

Upload: haley-barry

Post on 28-Mar-2015

224 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

1

Why Don’t Scientists Use Why Don’t Scientists Use Databases?Databases?

Peter BunemanPeter BunemanDivision of InformaticsDivision of InformaticsUniversity of EdinburghUniversity of Edinburgh

Digital Libraries grant IIS 98-17444

(NSF,DARPA,NLM, LoC,NEH, NASA)

http://db.cis.upenn.edu

http://db.cis.upenn.edu/Research/provenance.html

Page 2: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

2

Why Don’t Scientists Use Why Don’t Scientists Use DatabasesDatabases

MuchMuch

??RelationalRelational

Thanks to:• The ontologists and astronomers at Edinburgh• The database and bio-informatics groups at Penn• Aleri Inc.Special thanks (material stolen from)• Sanjeev Khanna, Wang-Chiew, Keishi Tajima, Susan Davidson, Fidel Salas

Page 3: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

3

Scientific Data is UbiquitousScientific Data is Ubiquitous

• 500 or so public molecular biology databases.– much discovery in silico

• Vast amounts of satellite imagery– maintaining it is very expensive

• Terabytes of astronomical data (not image data)• Linguistic corpora are essential research tools --

also in terabytes

Page 4: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

4

Relational DBs -- tabular dataRelational DBs -- tabular dataId Name Address123 H. Simpson Springfield456 L. Simpson Springfield321 A. Jones London

Id Course Grade456 Geometry A123 Algorithms D456 Voltaire A321 Geometry B321 Algorithms C

Title Dept TeacherAlgorithms CompSci Dr. DeadheadVoltaire French Prof. lePewGeometry Math Dr. Obtuse

•Useful information is obtained by combining tables.•Efficient algorithms for

– comining and indexing tables– transaction processing (updates and multiple users)

• Relational databases are a multi giga-$ industry

Page 5: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

5

Reasons for MismatchReasons for Mismatch

• Scientific data sets are too large (image data, huge analyses)

• Scientific data is too complex• Relational databases don’t work

well with arrays and scientific computation

• Schema evolution and history are important

• Databases are too expensive

Page 6: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

6

Swissprot -- Swissprot -- a curated a curated databasedatabase(1 of ~100,000 (1 of ~100,000 entries)entries)

ID 11SB_CUCMA STANDARD; PRT; 480 AA.AC P13744;DT 01-JAN-1990 (REL. 13, CREATED)DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988).RN [2]RP SEQUENCE OF 22-30 AND 297-302.RA OHMIYA M., HARA I., MASTUBARA H.;RL PLANT CELL PHYSIOL. 21:157-167(1980).CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).DR EMBL; M36407; G167492; -.DR PIR; S00366; FWPU1B.DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.KW SEED STORAGE PROTEIN; SIGNAL.FT SIGNAL 1 21FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).FT CONFLICT 27 27 S -> E (IN REF. 2).FT CONFLICT 30 30 E -> S (IN REF. 2).SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE//

...OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988)....

MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVRRAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIAIPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIVFADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE...

Data

Metadata

???

Page 7: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

7

ID 11SB_CUCMA STANDARD; PRT; 480 AA.AC P13744;DT 01-JAN-1990 (REL. 13, CREATED)DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988).RN [2]RP SEQUENCE OF 22-30 AND 297-302.RA OHMIYA M., HARA I., MASTUBARA H.;RL PLANT CELL PHYSIOL. 21:157-167(1980).CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).DR EMBL; M36407; G167492; -.DR PIR; S00366; FWPU1B.DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.KW SEED STORAGE PROTEIN; SIGNAL.FT SIGNAL 1 21FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).FT CONFLICT 27 27 S -> E (IN REF. 2).FT CONFLICT 30 30 E -> S (IN REF. 2).SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE//

RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988).RN [2]RP SEQUENCE OF 22-30 AND 297-302.RA OHMIYA M., HARA I., MASTUBARA H.;RL PLANT CELL PHYSIOL. 21:157-167(1980).

Hierarchical data.Order important.

DT 01-JAN-1990 (REL. 13, CREATED)DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)

Record (inadequate)of history

Page 8: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

8

ID 11SB_CUCMA STANDARD; PRT; 480 AA.AC P13744;DT 01-JAN-1990 (REL. 13, CREATED)DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988).RN [2]RP SEQUENCE OF 22-30 AND 297-302.RA OHMIYA M., HARA I., MASTUBARA H.;RL PLANT CELL PHYSIOL. 21:157-167(1980).CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).DR EMBL; M36407; G167492; -.DR PIR; S00366; FWPU1B.DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.KW SEED STORAGE PROTEIN; SIGNAL.FT SIGNAL 1 21FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).FT CONFLICT 27 27 S -> E (IN REF. 2).FT CONFLICT 30 30 E -> S (IN REF. 2).SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE//

FT SIGNAL 1 21FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).FT CONFLICT 27 27 S -> E (IN REF. 2).FT CONFLICT 30 30 E -> S (IN REF. 2).

Array indices (array operations?)

OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.

Tree data (recursivequery processing?)

Page 9: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

9

ID 11SB_CUCMA STANDARD; PRT; 480 AA.AC P13744;DT 01-JAN-1990 (REL. 13, CREATED)DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988).RN [2]RP SEQUENCE OF 22-30 AND 297-302.RA OHMIYA M., HARA I., MASTUBARA H.;RL PLANT CELL PHYSIOL. 21:157-167(1980).CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).DR EMBL; M36407; G167492; -.DR PIR; S00366; FWPU1B.DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.KW SEED STORAGE PROTEIN; SIGNAL.FT SIGNAL 1 21FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).FT CONFLICT 27 27 S -> E (IN REF. 2).FT CONFLICT 30 30 E -> S (IN REF. 2).SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE//

CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).

Structure in comments= schema evolution

Page 10: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

10

To turn Swissprot into tables To turn Swissprot into tables requires:requires:

• 20 - 30 tables – nothing extraordinary by relational

standards, but– huge query to reconstruct original form

• Invented keys • Queries on order and arrays• Recursive query processing• Also need to deal with schema evolution

Page 11: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

11

Curated DatabasesCurated Databases

• Useful scientific databases are often curated : they are created/ maintained with a great deal of “manual” labour.

select xyzfrom pqrwhere abc

Database people’s idea of what happens

What really happens

DB1 DB2

Page 12: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

12

Database Inter-dependence is Database Inter-dependence is ComplexComplex

GERD

TRRD

GenBank

Swissprot

EpoDB

TransFac

GAIA

BEAD

Page 13: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

13

Three new topicsThree new topics

• Annotation– how do I annotate a data element, and how

is this passed through queries?• Archiving

– how do we keep all the old versions of a database?

• Vertical partitioning.– combining databases and vector processing.

Page 14: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

14

Data AnnotationData Annotation(Khanna, Tan)(Khanna, Tan)

• Some databases (e.g. biology and linguistics) are designed to accommodate annotations

• Also a need for ad hoc (unanticipated) annotations.– How are annotations communicated?– How are they passed through queries?

• No general techniques or principles.

Page 15: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

15

Restaurant Cost Type

Peacock Alley

Bull & Bear

PacificaSoho Kitchen & Bar

$$$ French

$$$ Seafood

$ Chinese$ American

Restaurant Cost Type

PacificaSoho Kitchen & Bar

$ Chinese$ American

All Restaurants (View 1) Cheap Restaurants (View 2)

Yummy chicken curry!!

NYRestaurants (Source Table)

Restaurant Cost Type

Peacock Alley

Bull & Bear

PacificaSoho Kitchen & Bar

Zip

$$$ French 10022

$$$ Seafood 10022

$ Chinese 10013$ American10022

Serves fine French Cuisine in elegant setting. Jackets required.

Extensive wine list!

Sharing annotationsSharing annotations(courtesy of Wang-Chiew Tan)(courtesy of Wang-Chiew Tan)

Page 16: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

16

Annotation Annotation lookslooks simple but ... simple but ...

• Computing how an annotation should move through a query is intractable

• Equivalent queries may not carry annotations in the same way

• New insights are needed!

Page 17: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

17

How do we Build Archival Databases?How do we Build Archival Databases?[Khanna, Tajima, Tan][Khanna, Tajima, Tan]

• Many scientific database keep archives. It’s important to preserve the state of knowledge as it was in the past

• Archive frequently: space consuming

• Archive infrequently: delay in getting recent information published.

Page 18: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

18

The dangers of electronic documentsThe dangers of electronic documents

http://www.ornl.gov/hgmis/publicat/miscpubs/bioinfo/inf_rep2.html#AppenI

APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASEContinued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases.

Some examples of such queries are given in this appendix. Note, however, until a fully atomized sequence database is available (i.e., no data stored in ASCII text fields), none of the queries in this appendix can be answered. ...

APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASEContinued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases.

Some examples of such queries are given in this appendix. Note, however, until a fully atomized sequence database is available (i.e., no data stored in ASCII text fields), none of the queries in this appendix can be answered. ...

APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASEContinued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases.

Some examples of such queries are given in this appendix. Note, however, until a fully relationalized sequence database is available, none of the queries in this appendix can be answered. ...

APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASEContinued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases.

Some examples of such queries are given in this appendix. Note, however, until a fully relationalized sequence database is available, none of the queries in this appendix can be answered. ...

Now:Now:

Then:Then:

Report of a DOE “bioinformatics summit” ca. 1994

(No archive/edition! No footnote!)

Page 19: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

19

Examples from BioinformaticsExamples from Bioinformatics

• Swissprot. New version produced every four months. – Old versions are kept.– Difficult to get at most recent data

• OMIM. New version produced every day– Old versions are not kept– Impossible to reconstruct past states of the

data

Page 20: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

20

Current approaches use “diff”Current approaches use “diff”

need to preserve “object continuity” through time

Version 1:

<DB><Person> <Name>Joe</> <DateOfBirth>March</> <Address>South Street</> <Zip>12345</> </><Person> <Name>Jane</> <DateOfBirth>May</> <Address>Pine Street</> <Zip>67890</></></>

Output of line diff (versions 1-2):

3,4c<Name>Jane</><DateOfBirth>May</>9,10c<Name>Joe</><DateOfBirth>March</>

Version 2:

<DB><Person> <Name>Jane</> <DateOfBirth>May</> <Address>South Street</> <Zip>12345</> </><Person> <Name>Joe</> <DateOfBirth>March</> <Address>Pine Street</> <Zip>67890</></></>

1234567891011121314

LineNumber

Page 21: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

21

A Sequence of VersionsA Sequence of Versions

Page 22: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

22

““Pushing” time downPushing” time down

[Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ]

Page 23: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

23

Experimental Experimental Results (OMIM)Results (OMIM)

Number of versions

Siz

e (

byte

s) x

10

6

XMill(archive)

gzip(inc diff)

versionarchive, inc diff

Legend•archive•inc diff •version•compressed inc diff•compressed archive

Uncompressed

• Archive size is

1.01 times diff repository size

1.04 times size of largest version

Compressed • archive size is between 0.94 and

1 times compressed diff repository size

• gzip - unix compression tool• XMill - XML compression tool

Page 24: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

24

Experimental Experimental Results (Swissprot)Results (Swissprot)

Uncompressed• Archive size is

1.08 times diff repository size

1.92 times size of largest version

Compressed • archive size is between 0.59 and 1

times compressed diff repository size

• gzip - unix compression tool• XMill - XML compression tool

Number of versions

Siz

e (

byte

s) x

10

6

arc

hiv

e

XMill(archive)

vers

ion

inc

diff

gzip(in

c diff)

Legend•archive•inc diff •version•compressed inc diff•compressed archive

Page 25: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

25

The Bottom LineThe Bottom Line• We have built an archiver, using XML as the

base format• We can build a year of archives (archive as

often as you like) for a 14% increase on the size of the most recent database

• Based on keys -- preserves object history• Works well with compression• Obtaining an old archive is no more

expensive than getting the current version.

Page 26: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

26

Vertical partitioningVertical partitioning

• An old idea revisited• Fusion of array processing languages and

database query languages• Substantial use on Wall Street!!!

Page 27: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

27

Conventional StorageConventional Storage

Rows are stored contiguously. Order is not preserved(Horizontal partitioning)

disk pages

Page 28: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

28

Problems with conventional storageProblems with conventional storage

• Unanticipated queries will probably read the whole database

SELECT average sqrt(shoe-size)FROM employeeWHERE hat-size > shoe-size

(this only needs two fields)• Order or rows is “random” and does not

support order-sensitive functions: moving window averages, convolutions, etc.

Page 29: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

29

Vertical partitioning (vectorisation)Vertical partitioning (vectorisation)

Columns are stored contiguously. Order is preserved(Vertical partitioning)

disk pages

Page 30: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

30

Advantages of Vertical PatitioningAdvantages of Vertical Patitioning• Faster queries.

– A query that reads 2 columns in 100 does 2% of the i/o (i/o cost dominates)

– A few columns can often reside in memory.• Computation on order• Can use both SQL and vector processing languages• Downside: deletions are horribly expensive.

– but deletions are uncommon in scientific DBs• Vertical partitioning can also be performed on

hierarchical structures -- like Swissprot -- and XML

Page 31: NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh

NeSC, 25 April 2002 Why Don’t Scientists Use Databases?

31

Many other issuesMany other issues• Heterogeneous data integration

– a perennial problem– can it be done by the end-users?

• Distributed query evaluation against redundant, constrained data.

• Data provenance• Data streams• and many more

All these involve hard, fundamental All these involve hard, fundamental problems in Computer Scienceproblems in Computer Science