![Page 1: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/1.jpg)
Curated Databases
Peter Buneman
School of InformaticsUniversity of Edinburgh
![Page 2: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/2.jpg)
2
The Population of Corfu113.479 (2001) http://www.corfunext.com/corfu_geography.htm
107,879 (as of 2001 ) http://en.wikipedia.org/wiki/Corfu ***
93,000 http://www.corfunet.com/corfu/
109,512 www.agni.gr/
110,000 www.corfuvisit.net
70,000 http://www.newadvent.org/cathen/04362a.htm
107,600 http://www.greek-hotels.com/
105,043 http://www.merriam-webster.com/dictionary/corfu
approximately 110,000 www.kassiopi.com/MenuContent.aspx?MenuId=6
approximately 120.000 http://www.gardeno-corfu.com/
115,200 (2003 est) http://encyclopedia.farlex.com/Corfu
around 110,000 http://www.sunshinetravel.gr/CORFUGUIDE/CORFU_TRAVEL_GUIDE 0-1.htm
110.000 http://www.dialashop.com/travel/corfu.html
about 110,00 http://www.argobenitses.gr/greece.php
97,102 in 1981 http://geography.howstuffworks.com/europe/corfu.htm
107,880 http://catalogue.horse21.net/greece+hotels/corfu+hotels/hotels5/luxury
109,512 http://www.corfu-property.gr/content/view/14/38/lang,en/
about 100,000 http://members.virtualtourist.com/m/6ce90/67541/
110,000 approximately http://www.corfu-island.org/features.htm
107,000 http://www.nytimes.com/2009/09/11/greathomesanddestinations/11iht-recorfu.html
*** The only site to give attribution: http://www.statistics.gr/portal/page/portal/ESYEECDL
![Page 3: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/3.jpg)
3
These are both curated databasesECDL
![Page 4: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/4.jpg)
4
What is a curated database?
• A curated database is one that is maintained with a lot of human effort
• Curare: Latin “to care for”• Prime concern is quality of dataECDL
![Page 5: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/5.jpg)
5
What is a database?(for the purposes of this talk)
• Any structured collection of data that is subject to change/revision – Ontologies– XML and other structured text files– Structured wikis – Standard relational and object-oriented databases
ECDL
![Page 6: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/6.jpg)
6
Curated databases have interesting properties…
• A digital reference work. Traditional dictionaries, gazetteers, encyclopedia have been replaced by curated databases.
• Value lies in the organization and annotation of data• Commonly constructed by copying parts of other (curated)
databases.• Rapidly increasing in scientific research. (> 1000 in molecular biology)• Constantly checked/verified. Data quality and timeliness are
important.
• Often group efforts. Produced by a dedicated organization or collaboration.
• Increasingly seen as “publications” by scientists. (You get kudos if someone uses your database – like a citation.)
ECDL
![Page 7: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/7.jpg)
... and they are very expensive
10-7Big physics (LHC) data
10-3[Movie]
10-1Book
1“Production” code/Curated data
10“Reliable” code / Curated data
In $/€/£ per byte
![Page 8: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/8.jpg)
8
A change for the better?
Storage:• Redundant• Persistent• Distributed• Readable by peopleClear standards for citationHistorical record (old data is useful)Well understood ownership/IP
Storage:• Single-source• Volatile• Centralised• Internal DBMS formatNo standards for citationNo historical recordMind-boggling legal issues
20th century libraries did some things better!
ECDL
![Page 9: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/9.jpg)
9
Some computer science issues
• Archiving (CS usage)• Provenance• Annotation/citation• Data cleaning
All of these are intimately connected.
For example, if you cite some part of a curated database, the version you cited should be available (archiving)
ECDL
![Page 10: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/10.jpg)
10
CIA World Factbook Uniprot
Some well-known curated databasesID 11SB_CUCMA STANDARD; PRT; 480 AA.AC P13744;DT 01-JAN-1990 (REL. 13, CREATED)DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988).RN [2]RP SEQUENCE OF 22-30 ND 297-302.RA OHMIYA M., HARA I., MASTUBARA H.;RL PLANT CELL PHYSIOL. 21:157-167(1980).CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).DR EMBL; M36407; G167492; -.DR PIR; S00366; FWPU1B.DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.KW SEED STORAGE PROTEIN; SIGNAL.FT SIGNAL 1 21FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).FT CONFLICT 27 27 S -> E (IN REF. 2).FT CONFLICT 30 30 E -> S (IN REF. 2).SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE//
ECDL
![Page 11: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/11.jpg)
11
Archiving / Database Preservation
• How do we preserve something that evolves (both in content and structure)
• Keep snapshots?– frequent: space consuming– infrequent: lose “history”
Most curated databases have a hierarchical structure that we can exploit…
ECDL
![Page 12: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/12.jpg)
12
A Sequence of Versions
ECDL
![Page 13: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/13.jpg)
13
This relies on a deterministic / keyed model – there’s a unique path to every data item.
Pushing time down
ECDL
[B., Khanna, Tajima, Tan, TODS 27,2 (2004)]
![Page 14: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/14.jpg)
14
An initial experiment
• Grabbed the last 20 available versions of Swissprot• XML-ized all of them• Also recorded all OMIM versions for about 14 weeks
(100 of them)• Combined into archive XML format file by pushing
time down.
ECDL
![Page 15: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/15.jpg)
15
100 days of OMIM
Siz
e (
byte
s) x
10
6
XMill(archive)
gzip(inc diff)
versionarchive, inc diff
Legend•archive•inc diff •version•compressed inc diff•compressed archive
Uncompressed
• Archive size is – 1.01 times diff repository
size– 1.04 times size of largest
versionCompressed • archive size is between 0.94 and
1 times compressed diff repository size
• gzip - unix compression tool• XMill - XML compression tool
ECDL
![Page 16: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/16.jpg)
16
~ 5 years of UniProt
Siz
e (
byte
s) x
10
6
arc
hiv
e
XMill(archive)
vers
ion
inc
diff
gzip(in
c diff)
Legend•archive•inc diff •version•compressed inc diff•compressed archive
Uncompressed• Archive size is
– 1.08 times diff repository size
– 1.92 times size of largest version
• Compressed • archive size is between 0.59 and
1 times compressed diff repository size
ECDL
![Page 17: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/17.jpg)
17
Snapshots are immediate. Longitudinal/temporal queries are also easy
Factbook
Demography
Andorra
LiechtensteinChina
Economy
Population
[1990-2006]
*
* * *
**
[1990] [1991] [2006]…
Plot, by year, the population of Liechtenstein since 1990
34,24728,292 28,476
ECDL
![Page 18: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/18.jpg)
18
A Working System
• Implemented by Heiko Müller
• For scale, we require external sorting of large XML files
• Designed and implemented by Ioannis Koltsidas Heiko Müller and Stratis Viglas
• Has a simple temporal query language
• Experimented with recent (HTML) versions of CIA world factbook
ECDL
![Page 19: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/19.jpg)
19
What the archive looks like<T t="2002-2007"> <FACTBOOK> <COUNTRY> <NAME>Afghanistan</NAME> <CATEGORY> <NAME>Communications</NAME> <PROPERTY> <NAME>Internet users</NAME> <TEXT> <T t="2004-2005">1,000 (2002)</T> <T t="2006-2007">30,000 (2005)</T> <T t="2002-2003">NA</T> </TEXT> </PROPERTY> <PROPERTY> <NAME>Radios</NAME> <TEXT>167,000 (1999)</TEXT> </PROPERTY> <PROPERTY> <NAME>Telephones - main lines in use</NAME> <TEXT> <T t="2006">100,000 (2005)</T> <T t="2007">280,000 (2005)</T> <T t="2002-2003">29,000 (1998)</T> <T t="2004-2005">33,100 (2002)</T>…
ECDL
![Page 20: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/20.jpg)
20
How did the population of Chinachange from 2002-2007?
<T t="2002-2007"> <FACTBOOK> <COUNTRY> <CATEGORY> <PROPERTY> <NAME>Population</NAME> <TEXT> <T t="2002">1,284,303,705 (July 2002 est.)</T> <T t="2003">1,286,975,468 (July 2003 est.)</T> <T t="2004">1,298,847,624 (July 2004 est.)</T> <T t="2005">1,306,313,812 (July 2005 est.)</T> <T t="2006">1,313,973,713 (July 2006 est.)</T> <T t="2007">1,321,851,888 (July 2007 est.)</T> </TEXT> </PROPERTY> </CATEGORY> </COUNTRY> </FACTBOOK></T>
ECDL
![Page 21: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/21.jpg)
21
How did land area of countries change in 2002-2007?<T t="2002-2007"> <FACTBOOK KEY=""> … <COUNTRY KEY="NAME Austria"> <CATEGORY KEY="NAME Geography"> <PROPERTY KEY="NAME Area"> <SUBPROP> <NAME>land</NAME> <TEXT> <T t="2004-2007">82,444 sq km</T> <T t="2002-2003">82,738 sq km</T> </TEXT> </SUBPROP> </PROPERTY> </CATEGORY> </COUNTRY> … <COUNTRY KEY="NAME France"> <CATEGORY KEY="NAME Geography"> <PROPERTY KEY="NAME Area"> <SUBPROP> <NAME>land</NAME> <TEXT> <T t="2002-2006">545,630 sq km</T> <T t="2007">640,053 sq km; 545,630 sq km (metropolitan France)</T> </TEXT>… ECDL
![Page 22: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/22.jpg)
22
What are the differences between the factbookson 21/08/2007 and 10/09/2007?
<T t="21/08/2007-10/09/2007"> <CIAWFB KEY=""> <COUNTRY KEY="NAME Afghanistan"> <CATEGORY KEY="NAME Communications"> <PROPERTY KEY="NAME Internet users"> <T t="21/08/2007"> <TEXT>30,000 (2005)</TEXT> </T> <T t="10/09/2007"> <TEXT>535,000 (2006)</TEXT> </T> </PROPERTY> <PROPERTY KEY="NAME Telephones - mobile cellular"> <T t="21/08/2007"> <TEXT>1.4 million (2005)</TEXT> </T> <T t="10/09/2007"> <TEXT>2.52 million (2006)</TEXT> </T> …
ECDL
![Page 23: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/23.jpg)
23
http://homepages.inf.ed.ac.uk/hmueller/xarch/download.html
Heiko Müller’s Xarch• Examples of use with
⁻ Ontologies⁻ XML files⁻ Relational databases
• Automatically converts RDBs into XML
• Efficiently extracts snapshots
• Simple temporal query language
ECDL
![Page 24: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/24.jpg)
24
Provenance – a huge issue
• Where did this data come from?• How did it get here?• How was it constructed?• . . .
Two schools of research:• Workflow (coarse-grain) provenance – a complete
record of how some large scientific analysis/simulation was performed.
• Data (fine-grain) a record of how some small piece of data (in a larger databases) was produced
ECDL
![Page 25: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/25.jpg)
25
Copy-paste, or<cntrl>C <cntrl>V
113.479 (2001) http://www.corfunext.com/corfu_geography.htm
107,879 (as of 2001 ) http://en.wikipedia.org/wiki/Corfu ***
109,512 www.agni.gr/
105,043 http://www.merriam-webster.com/dictionary/corfu
115,200 (2003 est) http://encyclopedia.farlex.com/Corfu
97,102 in 1981 http://geography.howstuffworks.com/europe/corfu.htm
107,880 http://catalogue.horse21.net/greece+hotels/corfu+hotels/hotels5/luxury
Data provenance: an example
ECDL
![Page 26: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/26.jpg)
26
“Where provenance”
Possible explanations of how something was copied:
This data item was extracted from location L1 in document D1 and placed in location L2 in document D2
or
This data item was extracted from database D1 by query Q1 and placed in database D2 by update U2
(or some combination of the two)
ECDL
![Page 27: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/27.jpg)
27
ID 11SB_CUCMA STANDARD; PRT; 480 AA.AC P13744;DT 01-JAN-1990 (REL. 13, CREATED)DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.RN [1]RP SEQUENCE FROM N.A.RC STRAIN=CV. KUROKAWA AMAKURI NANKIN;RX MEDLINE; 88166744.RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;RL EUR. J. BIOCHEM. 172:627-632(1988).RN [2]RP SEQUENCE OF 22-30 AND 297-302.RA OHMIYA M., HARA I., MASTUBARA H.;RL PLANT CELL PHYSIOL. 21:157-167(1980).CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).DR EMBL; M36407; G167492; -.DR PIR; S00366; FWPU1B.DR PROSITE; PS00305; 11S_SEED_STORAGE; 1.KW SEED STORAGE PROTEIN; SIGNAL.FT SIGNAL 1 21FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).FT CONFLICT 27 27 S -> E (IN REF. 2).FT CONFLICT 30 30 E -> S (IN REF. 2).SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE//
DE 11S GLOBULIN BETA SUBUNIT PRECURSOR.OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;OC VIOLALES; CUCURBITACEAE.
CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND ACC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY ACC DISULFIDE BOND.CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).
FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT.FT CHAIN 22 296 GAMMA CHAIN (ACIDIC).FT CHAIN 297 480 DELTA CHAIN (BASIC).FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID.FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL).
Where does this information come from? Which curator? Or was it the cited papers?Was it copied from some other DB?
Where Provenance
ECDL
![Page 28: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/28.jpg)
28
Copy-paste model of curated DBs
(a) A biologist copies some UniProt records into her DB.(b) She fixes entries so that UniProt PTMs are not confused with hers.(c) She copies in some publication details from OMIM(d) She corrects a mistake in a PubMed publication number.[B. Chapman, Cheney, Sigmod ’06]
Curated databases are not views!!
ECDL
![Page 29: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/29.jpg)
29
A very simple copy-paste language(uses a “deterministic” tree model)
How costly is it to record all this?
(1) delete c5 from T;(2) copy S1/a1/y into T/c1/y;(3) insert {c2 : {}} into T;(4) copy S1/a2 into T/c2;(5) insert {y : {}} into T/c2;(6) copy S2/b3/y into T/c2/y;(7) copy S1/a3 into T/c3;(8) insert {c4 : {}} into T;(9) copy S2/b2 into T/c4;(10) insert {y : 12} into T/c4;
ECDL
![Page 30: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/30.jpg)
30
How to reduce space
• Complete provenance: Record every update.• Transactional provenance: Record the links at the end of some
user-defined transaction (sequence of updates)• Hierarchical (inferred) provenance. Only record a link if it
cannot be inferred from the provenance of a higher node
Taken together these provide a substantial saving on storage. Overhead comparable with the size of the DB in some realistic simulations
ECDL
![Page 31: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/31.jpg)
31
(select A, 5 as B from R where A = 1)union (select * from R where A <> 1)
delete from R where A = 1; insert into R values (1,5)
update R set B = 5 where A = 1
1 3
6 7
1 5
6 7
1 3
6 7
1 5
6 7
1 3
6 7
1 5
6 7
Query languages and where provenanceA B A B
[B., Cheney, Vansummeren, TODS 33,4, 2008]ECDL
![Page 32: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/32.jpg)
32
Other forms of provenance in query languages
• Why-provenance: why is a tuple in the output, or what parts of the input “contributed to” the tuple? [Widom et al]
• How-provenance: how (by what process) was this tuple constructed. [Tannen et al]
DatabaseLarge, heterogeneoussource
Small part ofsource
Complex program or process
Simpler program/process
Taken together, these are the “explanation”.
“Piece” of data: data value, tuple.etc
ECDL
![Page 33: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/33.jpg)
33
Workflow provenance
Taken from [Cohen, et al DILS 2006]• Each step S1. . . S4 is itself a workflow.• How does one record an “enactment” of the workflow?• How much “context” does one record?
– from people– from databases that change
• Recent attempts to produce a general model– Open Provenance Model [Moreau et al. 2007]– Petri Net + Complex Object [Hidders et al.Inf Syst 2008]
ECDL
![Page 34: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/34.jpg)
34
Provenance is very general issue
• Intrinsic to data quality.• It is starting to be used in several areas of CS:
– Semantics of update languages.– Probabilistic databases– Data integration– Debugging schema transformations– File/data synchronization– Program debugging (program slicing)– Security
• The fundamental problem is finding the right model/models – can we combine data and workflow models?
ECDL
![Page 35: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/35.jpg)
35
Annotation – closely related to provenance
• Much of the activity of curators is the annotation of existing data.
• When we copy that data, we should also copy its annotations
• The propagation of annotation follows (where-) provenance
• But the story is more complicated because we often annotate views
ECDL
![Page 36: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/36.jpg)
36
The Distributed Annotation Server (DAS)
ECDL
![Page 37: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/37.jpg)
37ECDL
![Page 38: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/38.jpg)
Annotating Databases
Guinness Stout 5.0 Eire
Heineken Pilsner 5.0 Netherlands
Old Jock Ale 6.7 Scotland
Guinness Stout 7.5 Nigeria
Fischer Blonde 6.0 France
Guinness Stout
Heineken Pilsner
Old Jock Ale
Fischer Blonde
Guinness Stout 5.0 Eire
Heineken Pilsner 5.0 Netherlands
Fischer Blonde 6.0 France
Stijn says this is not a beer
Stijn says this is not a beer
Stijn says this is not a beer
π σ
Polygen [Wang & Madnick VLDB 1990], DBNotes [Bhagwat et al, VLDB 2004] Concern is propagation of annotations from views to source and back. Again, there is an interesting theory
Not strong
Not strong
Not strong
![Page 39: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/39.jpg)
39
How do you cite something in a database?
Many scientific databases ask you to cite them, but..• they don’t tell you how, or• they tell you to give the URL, or• they tell you to cite a paper about the database.
Nutrition Education for Diverse Audiences [Internet]. Urbana (IL): University of Illinois Cooperative Extension Service, Illinet Department; [updated 2000 Nov 28; cited 2001 Apr 25]. Diabetes mellitus lesson; [about 1 screen]. Available
from http://www.aces.uiuc.edu/~necd/inter2_search.cgi?ind=854148396
NLM Recommended Formats for Bibliographic Citation.Internet Supplement. NLM Technical report Bethesda, MD 20894, July 2001.
ECDL
![Page 40: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/40.jpg)
40
What is a citation?
Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov; 17(11):999-1001.[Location and descriptive information]
Ann. Phys., Lpz 18 639-641 Nature, 171,737-738(We often want more than location)
ECDL
![Page 41: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/41.jpg)
41
Automatically generating citations
{ DB=IUPHAR, Version=$v, Family=$f Receptor=$r, Contributors=$a, Editor=$e, Date=$d, DOI=$i} ←
/Root[ ]/Version[Number=$’v, Editor=$?e, DOI=$.i, Date=$.d] /Data[ ] /Family[FamilyName=$’f] /Contributor-list/Contributor=$+a] /Receptor[ReceptorName=$’r]
{ DB=IUPHAR, Version=11, Family=Calcitonin, Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner}, Editor=Tony Harmar, Date=Jan 2006, DOI=10.1234 }
A rule:
What gets generated (example):
ECDL
![Page 42: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/42.jpg)
42
Other topics: Data quality and data cleaning
• Published data often looks clean but is intrinsically messy– “Dead” fields in the underlying data– Multiple syntactic conventions– Abuse of / confusion over formats & schema
• Human errors require human correction– Automate error detection rather than error correction
• Cleaning is an essential prerequisite in any integration or preservation task.
ECDL
![Page 43: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/43.jpg)
43
Other topics: Evolution of Structure
• Curated DBs evolve from humble origins. Schemas are often wrong; they are– designed by people who don’t understand schemas– designed before the domain is fully understood
• Do ontologies help (you can build an ontology without worrying much about the schema) or do they defer the problem and make it worse?
ECDL
![Page 44: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/44.jpg)
44
The larger (economic and social) issues
• Who will archive/curate curated databases?• Should they be open-access?
– who pays for their maintenance?
• What are the legal/IP issues?
ECDL
![Page 45: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/45.jpg)
A case study: IUPHAR database(curated by Tony Harmar and team)
• “Standard” curated database • Labour-intensive (hundreds of
contributors)
• Valuable (supported by drug companies)
• Simple, clean structure – as seen by users
50m
IUPHARDCC
![Page 46: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/46.jpg)
46ECDL
![Page 47: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/47.jpg)
47ECDL
![Page 48: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/48.jpg)
48ECDL
![Page 49: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/49.jpg)
49ECDL
![Page 50: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/50.jpg)
50
We wanted to use our archiver
• Our first task was to convert the database into a hierarchical structure (following the web presentation) so that we could archive it.
• We used the Prata XML (Fan et al) publishing software• This had some unexpected benefits…
ECDL
![Page 51: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/51.jpg)
51
… <transduction>
<secondary>Y</secondary> <transcomments/> <transcites/> </transduction> </transductions> <ligandTypes> <ligandType> <typeName>Agonist</typeName> <ligandComment>empty</ligandComment> <ligands> <ligand> <radioactive>NO</radioactive> <endogenous>YES</endogenous> <alternative>NO</alternative> <ligandSpeices>Human</ligandSpeices> <ligandName>oxytocin</ligandName> <affinity>9.1-8.8</affinity> <ligandAction>Full Agonist</ligandAction> <ligandUnits>p<i>K</i><sub>d</sub></ligandUnits> <ligandCites>9</ligandCites> </ligand> <ligand> <radioactive>YES</radioactive> <endogenous>NO</endogenous> <alternative>NO</alternative> <ligandSpeices>Human</ligandSpeices> <ligandName>[<sup>3</sup>H]-oxytocin</ligandName> <affinity>9.1-8.8</affinity> <ligandAction>Full Agonist</ligandAction> <ligandUnits>p<i>K</i><sub>d</sub></ligandUnits> <ligandCites>9, 42</ligandCites> </ligand>…
ECDL
![Page 52: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/52.jpg)
52
• We can preserve all versions of the data (as intended)• We can generate static web pages (less software, more
efficient)• We can make the database citable• Tony can trace the history of entries• Tony can generate an old-fashioned book (yes, he wants to do
this!)• We have a “community model” for data exchange• The data got cleaned up in the process• The representation information (required by archivists) is
greatly simplified
ECDL
![Page 53: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/53.jpg)
Selected pages from the book – generated by a 100-line style sheet
53
![Page 54: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/54.jpg)
Our library will “host” the book, but not the database!
54
![Page 55: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/55.jpg)
55
Centralized vs. distributed publishing20th century libraries provided robust, distributed dissemination and preservation of reference material
Valuable information was lost in earlier “data centers” . Is this still happening?
Replication and distribution has always been the best guarantee of preservation. We should do the same for curated databases – a database LOCKSS ?
ECDL
![Page 56: Curated Databases Peter Buneman School of Informatics University of Edinburgh](https://reader035.vdocuments.us/reader035/viewer/2022062511/551b3074550346cf5a8b609d/html5/thumbnails/56.jpg)
56
Many of the issues are non “technical”
• A good economic model for sustainability– Open access works for journal papers– Can it work for curated DBs? They require long-term
support. And people who write reference manuals sometimes expect to make money out of them.
• Intellectual property in curated databases is a nightmare– legislation still largely based on the notion of copying.
• We can still help by providing good models of the processes in curating and publishing databases
ECDL