transparently gathering provenance with provenance aware condor christine reilly and jeffrey...
TRANSCRIPT
![Page 1: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/1.jpg)
Transparently GatheringProvenance
with Provenance Aware Condor
Christine Reilly and Jeffrey NaughtonDepartment of Computer Sciences
University of Wisconsin - Madison
TaPP ‘09 February 23, 2009
![Page 2: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/2.jpg)
2
Motivations for Our Work
• Scientific Computing
• Grid Computing
• Condor job scheduling system
![Page 3: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/3.jpg)
3
Motivation: Scientific Computing
• What input data was used to produce this output data?
• Example:– Blast sequence DB is updated periodically.– Did my last computation use the latest
version of the Blast DB?
![Page 4: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/4.jpg)
4
Motivation: Scientific Computing
• Need to find the root of anomalous results.
• Example:– Multiple members of a research group can
update the code for the simulation.– Did two researchers use the same version
of the executable?
![Page 5: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/5.jpg)
5
Motivation: Grid Computing
• What grid resources did my computation use?
• Example:– A machine in the grid has a hardware
problem (bad DIMM, corrupt disk).– Did any of my computations use the bad
machine?
![Page 6: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/6.jpg)
6
Motivation: Condor
• Condor:– Distributed computing system.– Runs jobs from wide range of applications
and fields of study.
• Quill: Captures operational information exposed by Condor.
• Could Quill be used for provenance?– Users would get provenance “for free”.
![Page 7: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/7.jpg)
7
Provenance System Goals
• Generic: Can be used by many different applications.
• Transparent: Users don’t need to alter their applications.
• Can we use Condor as the execution system for this type of provenance system?
• Can we do this with minimal impact on Condor developers?
![Page 8: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/8.jpg)
8
Our Model of Provenance
• File level granularity.• The provenance of a file is:
– All files involved in its creation– Job execution environment.
input exe output
ExecutionEnvironment
![Page 9: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/9.jpg)
9
Outline of Talk
• Motivation and Introduction
• PAC - Provenance Aware Condor
• Is PAC Practical? (Storage requirements and overhead)
• DBLife: Benefits and Limitations of PAC
• Conclusions
![Page 10: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/10.jpg)
10
Provenance Aware Condor
• Three parts: Condor, Quill, FileTrace• Condor is the job execution system.• Quill gathers operational information
from Condor.• FileTrace gathers information about files
used by Condor jobs.• Provenance is obtained by querying the
Quill database.
![Page 11: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/11.jpg)
11
What is Condor?
• Provides a simple interface to a cluster of machines.– User gives computing jobs to Condor.– Condor finds available machine that meets job
requirements.– Condor ensures that the job completes execution,
even in the presence of failures.
• Used for many different applications:– Physics, Biology, Chemistry, Computer Science,
Engineering applications.– Academic research, national labs, industry.
![Page 12: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/12.jpg)
12
What is Condor? (part 2)
• Research project at UW, but also resembles commercial software.– 35 faculty, full-time staff, students.– Used at ~2000 locations world-wide.– Running on more than 250,000 machines.
• Annual user conference attracts hundreds of attendees.
![Page 13: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/13.jpg)
13
SubmitMachine
ExecuteMachine
CentralManager
User’sMachine
User’sLocalFiles
SharedFile
System
JobSpecs Exe, In
Files
OutputFiles
Output Files
Executable, Input Files
MatchInfo
MatchInfo
JobRequirements
MachineResources
Condor Overview
![Page 14: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/14.jpg)
14
What is Quill?
• Gathers operational data from Condor.– Improves performance of user’s inquiries
about their jobs.– Information about Condor operations is
stored in RDBMS.
• Created by group of database researchers (including us).
• Ships with Condor.
![Page 15: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/15.jpg)
15
Quill’s Design Requirements
• Expose Condor operational data for querying through a DBMS interface.
• Cannot affect how Condor operates:– No changes to the existing Condor code.– No change to how Condor is used.
• Failures in Quill cannot cause Condor to fail.
![Page 16: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/16.jpg)
16
SubmitMachine
ExecuteMachine
CentralManager
User’sMachine
User’sLocalFiles
SharedFile
System
JobSpecs Exe, In
Files
OutputFiles
Output Files
Executable, Input Files
JobRequirements Match
InfoMatchInfo
MachineResources
Quill Overview
![Page 17: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/17.jpg)
17
File Access in Condor
• Three methods of accessing files:– Remote system calls– Condor File Transfer– Shared file system
• Quill’s information about files:– Detailed information about File Transfers– Limited information about some other files– No guarantee that it detects all files
• Problem: Because Condor does not track all file access, Quill may miss file information.
• FileTrace gathers information about all files.
![Page 18: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/18.jpg)
18
FileTrace
• Modification of UNIX strace.• Transparent to Condor.• For each file open or close system call
FileTrace gathers:– Condor Job Id.– File information: pathname, last modified time,
checksum, size.– File access information: activity type, open flags,
file pointer.
![Page 19: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/19.jpg)
19
Provenance in PAC
• Provenance of a file is:– All files involved in its creation: Executable,
Input files, Libraries and system files.– Job execution environment: when, on what
machine.
• PAC has no control over files in the user’s file system.
![Page 20: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/20.jpg)
20
Provenance Queries in PAC
What files were used by this job?
SELECT pathname, activity_type, open_flags, last_modified, size, checksum
FROM filetraceWHERE globaljobid=‘my_job_id’;
pathname activity type open flags last
modifiedsize
(bytes) checksum
program.exe exe 2008-10-06 17:52:16-05
687 8FC8316B8930835C9085E 953F2D36745C2877346
/path/to/input.xml open O_RDONLY 2008-10-06 17:52:28-05
9554599 57B7A53ACF390B1928A6 C4B6E935DD9F2CBE9F46
/path/to/output.xml close 2008-10-06 17:55:19-05
2E+07 9D32885FDCD9171E44F9 302BB0141DFC1FC6B772
![Page 21: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/21.jpg)
21
Provenance Queries in PAC
More questions PAC can answer:
• Were any of my files created on the bad machine?
• Was this output created using the current versions of the input and executable files?
• Did any input files change between two runs of the application?
![Page 22: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/22.jpg)
22
Outline of Talk
• Motivation and Introduction
• PAC - Provenance Aware Condor
• Is PAC Practical? (Storage requirements and overhead)
• DBLife: Benefits and Limitations of PAC
• Conclusions
![Page 23: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/23.jpg)
23
Is PAC Practical?
• Users will not tolerate PAC if:– Data storage requirements are too
onerous.– Computational overhead is significant.
![Page 24: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/24.jpg)
24
Estimate Table Growth
• Use the run time and number of files used by seven scientific applications (BLAST, IBIS, CMS, Nautilus, Messkit Hartree-Fock, AMANDA, SETI@home).
• Simulate running each application constantly for a year on a 1000 machine cluster.
• How big is the FileTrace table?
![Page 25: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/25.jpg)
25
Size of File Trace TableApplication Continuously Run for 1 year on 1000 Machines
0
0.5
1
1.5
2
2.5
3
3.5
IBIS SETI CMS Nautilus Amanda HF Blast
Log Table Size (GB)
1.15 TB900 GB
9 GB
30 GB41 GB
190 GB
540 GB
![Page 26: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/26.jpg)
26
Test Cluster for Overhead
• Machine Specs:– Pentium 2.40 GHz Core 2 Duo– 4GB RAM– Two 250GB SATA-I hard disks.
• Condor Cluster: – Central Manager– Submit Machine– Database Machine– 10 execute machines
![Page 27: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/27.jpg)
27
Overhead of PAC
• Use synthetic program.– Writes random numbers to files.– Can vary the number of files generated
and run time.
• 4 sets of jobs with run time and number of files based on the scientific programs from the size experiment.– Run time per job: 5 or 20 minutes.– Files per job: 10 or 300.
![Page 28: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/28.jpg)
28
Overhead of PAC
• Two Clusters: PAC and Condor without Quill or FileTrace.
• For each set of jobs, submit as many jobs as will take ~12 hours to complete.
• No significant difference in average time per job or total run time.
![Page 29: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/29.jpg)
29
Outline of Talk
• Motivation and Introduction
• PAC - Provenance Aware Condor
• Is PAC Practical? (Storage requirements and overhead)
• DBLife: Benefits and Limitations of PAC
• Conclusions
![Page 30: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/30.jpg)
30
DBLife
• Community Information Management (CIM)
• Creates and maintains an ER graph of the database research community.
• Gathers data by crawling web.
• Interesting example for PAC:– Large, complex workflow.– Accesses many files.
![Page 31: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/31.jpg)
31
Overview ofDBLife
List of Sources
DomainKnowledge
Provided byCommunity
Expert
Data Page Layer
Mention Layer
Conference
University
PaperCompany
JoeER Layer
Joe
Jane
DBLifeHome
CIMApplication
![Page 32: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/32.jpg)
32
DBLife Front Page
![Page 33: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/33.jpg)
33
DBLife Person Homepage
![Page 34: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/34.jpg)
34
Questions PAC Can Answer
• In general - questions developers and system administrators would ask.
• Examples:– Is this the version of the program that was
used to create the current data set?– Was this machine used to create any part
of the DBLife data set?– Does the DBLife ER model reflect the
recent changes on my web page?
![Page 35: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/35.jpg)
35
Questions PAC Cannot Answer
• In general - questions that require knowledge of the semantics of DBLife.
• Examples:– Why does the DBLife portal show that
person X is affiliated with institution Y?– Why doesn’t my page on the DBLife portal
include that I presented a paper at this workshop?
![Page 36: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/36.jpg)
36
Problem: Large FileTrace Table
• Ran DBLife with 10 crawl start points:– FileTrace table: 100 MB
• Full scale run uses 1000 start points– FileTrace table est.: 10 GB
• Run DBLife daily for 1 year:– FileTrace table est.: 3.7 TB
• Possibilities for reducing table size:– Could be smarter about how data is stored.– Table compression.
![Page 37: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/37.jpg)
37
Outline of Talk
• Motivation and Introduction
• PAC - Provenance Aware Condor
• Is PAC Practical? (Storage requirements and overhead)
• DBLife: Benefits and Limitations of PAC
• Conclusions
![Page 38: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/38.jpg)
38
Limitations of PAC
• File level is coarse granularity.
• Files are not under PAC’s management.
• Workflows are not explicitly recorded.
• Doesn’t know about the semantics of an application.
![Page 39: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/39.jpg)
39
Contributions of PAC
• Many different applications can use PAC for provenance.
• Has little impact on Condor.
• Could be combined with other systems to yield a stronger provenance system.– Workflow management system.– File management system.
![Page 40: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/40.jpg)
40
Future Work
• Integrate PAC with an Xlog version of DBLife.– Xlog is similar to Datalog– Record information when Xlog rule is
triggered.– Provides fine grained provenance.– Can answer questions that depend on the
semantics of the application.
![Page 41: Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649f305503460f94c4aacb/html5/thumbnails/41.jpg)
41
Acknowledgements
• Supported in part by National Science Foundation Award SCI-0515491.
• Thank you CondorDB, DBLife, and Condor groups!