archiving scientific data - university of pennsylvania
TRANSCRIPT
![Page 1: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/1.jpg)
Archiving Scientific Data
SusanB.DavidsonCIS700:AdvancedTopicsinDatabases
MW1:30-3
Towne309
http://www.cis.upenn.edu/~susan/cis700/homepage.html
![Page 2: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/2.jpg)
• Datachangesovertime• Newdataisadded
• Mistakesarecorrected• Olddataisremoved
• Toenablereproducibilityandverifiability,itmustbepossibletoaccessthestateofadatabaseasofacertainpointintime.• Alsocrucialfordereferencingcitations
• Mayalsowanttoaskquestionsabouthowthedatabasehaschanged.
Why archive?
2
![Page 3: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/3.jpg)
• Manydatabasesperiodicallypublishnewversions• Keepcopyofeachversion
• Allowsdataasofacertaintimetobeaccessedquickly
• Maynotbespaceefficientsinceverylittlemaychangebetweenversions
• Doesn’tallowefficientqueriesoverthechangehistory
• Keepalogofchanges(“sequenceofdelta”)• Spaceefficient• Maybeexpensivetorecomputedataasofacertaintime
• Maybeexpensivetoquerychangehistory
How to archive?
3
![Page 4: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/4.jpg)
• Versioningandcitation:experienceswitheagle-i• ArchivingXMLdatasets• Conclusions
Outline
4
![Page 5: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/5.jpg)
• eagle-iisanRDFdatasetwhichcontainsinformationaboutresourcesfortranslationalresearch(e.g.software,celllines,labfacilities)
• Eachresourcehasanimmutableeagle-iid;thesubjectofeachresourcetripleisaneagle-iid
• Resourcesareclassifiedusinganontology,andthecitationdependsontheclassificationoftheresource.
• eagle-italkedaboutcitationbutdidn’tautomateit…
Our experience: eagle-i
5
![Page 6: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/6.jpg)
6
![Page 7: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/7.jpg)
7
![Page 8: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/8.jpg)
8
![Page 9: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/9.jpg)
Citation architecture
9
![Page 10: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/10.jpg)
• Thelatestcopyofeagle-iisavailableonthewebsite,butitisnot“versioned”
• Wedidadailydownloadsincewedidn’tknowhowfrequentlyitchanged(notfrequently!)
• Needed“timequeries”tounderstandhowthedatasetchangedovertime• Whattripleswereadded/deletedintheperiod[t,t’]?
• WhatwastheobjectoftripleXattimet?
• WhenwastripleYfirstadded/deleted
eagle-i versioning manager
10
![Page 11: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/11.jpg)
Example: versioning 2 RDF triples
11
![Page 12: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/12.jpg)
• Whenshouldversioningbetriggered?• Atleastwhenausercitesaneagle-iresource
• Whatshouldbeversioned?• Atleastchangestotheresourcebeingcited.
Ø Ifaversionofaresourceisnotcited,itdoesnothavetobestored.
Ø However,time-basedquerieswillonlydetectchangeswithrespecttocitationsratherthanallchanges.
Versioning and citation
12
![Page 13: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/13.jpg)
• Versioningandcitation:experienceswitheagle-i• ArchivingXML• Conclusions
Outline
13
![Page 14: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/14.jpg)
• Keepcopyofeachnewversionofthedatabase• Allowsdataasofacertaintimetobeaccessedquickly
• Maynotbespaceefficientsinceverylittlemaychangebetweenversions
• Doesn’tallowefficientqueriesoverthechangehistory
• Keepalogofchanges(“sequenceofdelta”)• Spaceefficient• Maybeexpensivetorecomputedataasofacertaintime
• Maybeexpensivetoquerychangehistory
Recall: approaches to archiving
14
![Page 15: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/15.jpg)
• Ignoresthe“semanticcontinuityofkeys”byfocusingonminimaleditdistance
Problem with diff-based approaches
15
![Page 16: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/16.jpg)
• Focusonhierarchicalscientificdatasets• XML-based• Changesareprimarilyinsertions
• Changesidentifiedbasedonkeys• Versionmergingbasedonkeys• Inheritanceoftimestamps
• Timestampisstoredatachildelementonlywhenitisdifferentfromthetimestampofitsparentelement
Ø “Key-based+merging”approach
Proposed approach in paper
16
![Page 17: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/17.jpg)
Example: sequence of versions
17
![Page 18: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/18.jpg)
Adding keys
18
![Page 19: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/19.jpg)
Example of an archive
19
![Page 20: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/20.jpg)
Representing archive in XML
20
![Page 21: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/21.jpg)
• Akeyhasform(Q,{P1,…,Pk}),whereQ,Piarepathexpressions• Qidentifiesthetargetset
• Piarekeypaths,analogoustokeyattributesinrelations
• AnXMLdocumentsatisfiesakey(Q,{P1,…,Pk})if• FromanynodeidentifiedbyQ,everyPiexistsuniquely• Iftwonodesn1andn2identifiedbyQhavethesamevalueattheendofeachkeypathin{P1,…,Pk}thenn1andn2arethesamenode.
What is a key for XML?
21
![Page 22: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/22.jpg)
• SinceXMLishierarchical,wealsoneedtospecifykeysrelativetoacontextnode• (Q,(Q’,{P1,…,Pk}))
• Examples• (/,(db,{})).Thereisatmostonedbelementbelowtheroot.
• (/db,(dept,{name})).Everydeptnodewithinadbnodecanbeuniquelyidentifiedbythecontentsofitsnamesubelement.
• (/db/dept,(emp,{fn,ln})).Everyempnodewithinadeptnodealongthepath/db/deptcanbeuniquelyidentifiedbythecontentsofitsfnandlnsubelements.
• (/db/dept/emp,(sal,{})).Thereisatmostonesalsubelementundereachempnodealongthepath/db/dept/emp.
Relative keys
22
![Page 23: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/23.jpg)
• Assumptions:• Everykeydefinedforanodeisrelativetoitsparent,e.g.thekeyforempisrelativetoitsparentdeptnode
• Frontiernodesidentifyunkeyedportionsofthedocument
Archiver architecture
23
![Page 24: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/24.jpg)
• Recursivelymergenodesintheincomingversion(D)tonodesinthearchive(A)thathavethesamekeyvalue,startingfromtheroot.
• WhenanodeyinDismergedwithanodexfromA,thetimestampofxisaugmentedwithi(thenewversionnumber),andsubtreesarerecursivelymerged.
• NodesinDthatdonothavenodesinAaresimplyaddedwithiasthetimestamp
Nested merge
24
![Page 25: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/25.jpg)
Further compaction under frontier node
25
![Page 26: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/26.jpg)
• Whatisthedatabaseatt=1?
• WhendidJoeDoegetasalaryraise?
• Whatwerethechangestothedatabasebetweent=1andt=3?
Querying the archive
26
![Page 27: Archiving Scientific Data - University of Pennsylvania](https://reader031.vdocuments.us/reader031/viewer/2022021106/62056719f241340f191f17c2/html5/thumbnails/27.jpg)
• Versioningisimportantformanydifferentapplications
• Whiletechniquesaresimilarbetweendifferentrepresentations(e.g.files,relations,XML,RDF),differencesinassumptionscanbeusedtobuildmoreefficientsolutions.• Andtheoperations(e.g.queries)youwishtoperformareimportanttoo!
Conclusions
27