the vault data manager derek hower 2/10/2011. summary this talk: – is: a conceptual overview &...
TRANSCRIPT
![Page 1: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/1.jpg)
The Vault Data Manager
Derek Hower2/10/2011
![Page 2: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/2.jpg)
Summary• This talk:
– Is: A Conceptual overview & discussion– Is not: A Vault tutorial– Is not: Polished. Interruptions will hide that.
• Vault unifies:– Data storage– Data analysis– Job management
• Features:– Designed for flexibility & sharing– Should be sufficient to meet NSF guidelines
• Proposal (open to discussion): – The group should phase-in Vault
![Page 3: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/3.jpg)
Outline• Elephants• Motivation/Goals• Vault Overview• Discussion• NSF• Status
![Page 4: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/4.jpg)
An Aside on • Vault is written (mostly) in Ruby– Don’t have to use it• Has a command line & web interface
– But…• Not all operations are accessible from command line• You need to write submission/analysis scripts anyway
• Will GEM5 stand for this “ruby” thing?– The simulator side component is in C
• Want it in Python?– I’m available for consultation
![Page 5: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/5.jpg)
So you built a DBMS?(a.k.a. Dear Spyros,)
• Vault does have elements of a DBMS– Serialized commit, file storage, etc.
• But is much more – Interface, Job management, Repository, etc
• Why not use a DBMS under the hood?– I think they are clumsy to work with– Some operations don’t map well (job stats,
permissions)
![Page 6: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/6.jpg)
Outline• Elephants• Motivation/Goals• Vault Overview• Discussion• NSF• Status
![Page 7: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/7.jpg)
Motivation• There is no unified data management plan– Collaborating can be a pain– Interpreting data can be a pain– Unstructured data is error prone• Custom parsers for every experiment, etc
• Loosely unified job management– Condor, but everyone has their own submission
scripts• Some people (me) need enforced organization– Vault was made for me. Maybe you’ll like it too.
![Page 8: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/8.jpg)
Goals• Repeatability– Don’t do anything until you know you can do it again
• Flexibility– Multiple tools– Storage – Migration & compression– Scheduling
• Promote Collaboration– Share data, actively work together– Protect data with permissions
• Data Integrity
![Page 9: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/9.jpg)
A Note on Storage• Why focus on storage reduction/management?– Aren’t stats just text files?
• Case Study: Rocks– Typical job:
• Stat file: 170K• Stdout: 743• Stderr: 27K• Config: 17K
– Total: 215K/job– 215K * 2000 jobs = 430M of text per experiment!!
• Key: Most of the text is redundant
![Page 10: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/10.jpg)
Outline• Elephants• Motivation/Goals• Vault Overview• Discussion• NSF• Status
![Page 11: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/11.jpg)
What is Vault?
• Demo Time!
![Page 12: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/12.jpg)
Features• Search• Consistency• Repeatability• Flexible permissions• Multiple views• Flexible storage options• Documentation• Result parsing tools• Modular software architecture• Annotations
![Page 13: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/13.jpg)
Configuration
Vault Object OrganizationRepository
ExperimentExperiment
Experiment
Job ScaffoldJob Scaffold
Job ScaffoldJob Scaffold
Job ScaffoldRun
Apparatus
Job ScaffoldJob Scaffold
Job
Stat MiscOut MiscOut
Scheduler
![Page 14: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/14.jpg)
Vault Repositories• Three components:– One Metafile– One or more Storage Directories– One or more Sandbox Directories
• Access managed by filesystem• To share or not to share?– + Increase collaboration– - Hard to manage storage needs– - Limited data protection– Vault’s answer: repository linking
![Page 15: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/15.jpg)
Repository Linking
Derek’s Repository
~drh5/vault.storage Perm: 744
Polina’s Repository
~pdudnik/vault.storage Perm: 744
Calvin Repository
…/projects/calvin/vault.storage Perm: 774
![Page 16: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/16.jpg)
Implementation Note
• Vault uses a flat storage scheme– Every object is a “blob” identified by a hash of its contents
• Benefits– Objects can be stored anywhere
• Repository Linking is easy• Storage management is flexible
– Identical files are stored once• Hash Collision?
– Chance is order 1:2^80. And it’s good enough for git.
~/vault.storage
5CA…1AB 1E0…BAD CAF…EBA BE0…111
![Page 17: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/17.jpg)
Experiments• Complete description of an experiment– Copy of the tool (apparatus)– Copy of all inputs– Copy of commands
• Becomes immutable once run– Exception: annotations
• Key to repeatability
![Page 18: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/18.jpg)
Apparatus• Describes how to control a tool– SCM control– Building– Running
• Allow Vault to be used with many different tools
• Apparati are vault plugins– Ruby code– Saved with the experiment
![Page 19: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/19.jpg)
Scheduler• Controls where and when jobs are run• Like Apparati, are Vault plugins• Two existing (more possible):– SerialScheduler– MultifacetCondorScheduler
Run• Container for a run of an experiment• Experiments may be run multiple times
• Contains:• Scheduler, Jobs
![Page 20: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/20.jpg)
Job Scaffold• Describes how a job is configured & controlled• Elements:– Configuration– Command line– Repetitions
Configuration• Can be:• A standard vault configuration • <key>:<value> list
• A non-standard text file
![Page 21: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/21.jpg)
Stats• All vault tools *must* use the vault stat
infrastructure• C/C++ library– Collection of macros• vs_new_signed_scalar(name, desc, data_ptr)• vs_new_signed_sarray(name, desc, size, array_ptr)• etc.
– Below tool stat managers (e.g., GEM5 stat class)– Includes stat server for real-time updates
![Page 22: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/22.jpg)
Stat File Format• Produces two files– Header• XML description of stats
– Data• Binary data file
• Most jobs from same tool produce identical headers– Vault’s storage stores one copy
• Data files are small
![Page 23: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/23.jpg)
Views• Two (three?) views– Command line– Web server– Access through Ruby
• PIs: only need to know one command– vault serve
• Demo to follow
![Page 24: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/24.jpg)
Configuration
Vault OrganizationRepository
ExperimentExperiment
Experiment
Job ScaffoldJob Scaffold Job Mold
Job ScaffoldJob Scaffold
Run Apparatus
Job ScaffoldJob Scaffold
Job
Stat MiscOut MiscOut
Scheduler
![Page 25: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/25.jpg)
Data Analysis• Unified data storage/access leads to common
analysis tools/techniques• Vault comes with a few neat parsing helpers– e.g., in Ruby:
– Finds all jobs matching config, gets the stat “insns” from each, and returns the arithmetic mean of all of them
insns = repo.find(:config => some_config).insns.arith_mean
![Page 26: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/26.jpg)
Outline• Elephants• Motivation/Goals• Vault Overview• Discussion• NSF• Status
![Page 27: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/27.jpg)
About Repeatability• Vault experiments are repeatable because:– Experiments are run from versioned source code– Inputs are logged
• Vault experiments may not be repeatable if– The SCM repository moves/disappears– Software update• But, can reconstruct the original software
![Page 28: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/28.jpg)
Data Integrity• Vault behaves like an SCM/DBMS– Nothing is written to the repository until commit
• Allows script development without polluting repository
![Page 29: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/29.jpg)
Best Practices• TBD– Storage structure?– Experiment naming convention?– What to do when something goes wrong?
(experiment fails)
![Page 30: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/30.jpg)
Outline• Elephants• Motivation/Goals• Vault Overview• Discussion• NSF• Status
![Page 31: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/31.jpg)
NSF Data Management Plans• the types of data, samples, physical collections,
software, curriculum materials, and other materials to be produced in the course of the project;– Vault stat files
• the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);– Vault can conform to *any* standard (stat templates)
![Page 32: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/32.jpg)
NSF Data Management Plans• policies for access and sharing including
provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;– Filesystem permissions
• policies and provisions for re-use, re-distribution, and the production of derivatives; and– Vault’s emphasis on repeatability
![Page 33: The Vault Data Manager Derek Hower 2/10/2011. Summary This talk: – Is: A Conceptual overview & discussion – Is not: A Vault tutorial – Is not: Polished](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649cab5503460f9496bf63/html5/thumbnails/33.jpg)
NSF Data Management Plans• plans for archiving data, samples, and other
research products, and for preservation of access to them.– Vault’s emphasis on repeatability– Data is backed up in AFS