virtual logbooks and collaboration in science and software development dimitri bourilkov, vaibhav...
TRANSCRIPT
Virtual Logbooks and Virtual Logbooks and CollaborationCollaboration
in Science and Software Developmentin Science and Software Development
Dimitri BourilkovDimitri Bourilkov, Vaibhav Khandelwal, , Vaibhav Khandelwal, Archis Kulkarni, Sanket TotalaArchis Kulkarni, Sanket Totala
University of FloridaUniversity of Florida
IPAW’06IPAW’06
Internaional Provenance and Annotation Internaional Provenance and Annotation WorkshopWorkshop
Chicago, USA, May 3, 2006Chicago, USA, May 3, 2006
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 22
Virtual Logbook Motivations
• Data tracktrack-ability and result auditaudit-ability: "Virtual Logbook”• In the nature of science
• Reproducibility of / confidence in results
• Individuals exploreexplore other scientists’ work and build from it
• Different Teams can work in a modular, semi-autonomousmodular, semi-autonomous fashion: reuse previous data/code/results or entire analysis chains
• Newcomers get a flying startNewcomers get a flying start
• Make group knowledge easily accessible 24/7
• Keep project history
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 33
Consider a researcher A working on her terminal
A’s session includes commands, scripts executed, output produced, environment changes made and other information for some achieving particular goal.
All such sessions of A can be logged easily using Codesh and reproduced later
Consider a collaborator B working remotely
B can extract all of A’s sessions.
B can modify some of A’s sessions and log again and also log some of his own sessions
We could have a totally Distributed System comprising of many collaborators, making use of either private repositories or shared repositories
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 44
Collaborative Community Tools:CAVES / CODESH Projects
• Concentrate on the interactions between scientists collaborating over extended periods of time
• Seamlessly log, exchange and reproduce results and the corresponding methods, algorithms and programsAutomatic and complete logging and reuse of work /
analysis sessions: collect all user activities on the command line + code of all executed programs
• Extend the power of users working / performing analyses in their habitual way, giving them virtual logbook capabilities
• CAVES is used in normal data analysis sessions with the analysis package ROOT (High Energy Physics)
• CODESH is a UNIX shell with virtual logbook capabilities• Build functioning collaboration suites - close to users!• Formed a team: CODESH: DB & Vaibhav Khandelwal;
CAVES: DB & Sanket Totala & Archis Kulkarni
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 55
CAVES / CODESH ArchitecturesScalable and Distributed
First prototypes use popular tools: Python, C++, ROOT and CVS; e.g. all ROOT commands and CAVES commands or all UNIX shell commands and CODESH commands available
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 66
CAVES / CODESH Architectures – CVS Server Backend
• Work on per session basis
• CVS provides version controlversion control by tagging releases• CVS tagstags act as unique IDsunique IDs for virtual sessions
(the namespace can be structured by a collaborating group e.g. one big cave or many barrels in a cave, selected on a session basis)
• Both locallocal and remoteremote modes of working
• CVS pservers (secure, efficient remote stores):• Only CVS user accounts with password authentication,
no UNIX accounts on the server (gridmapfile uses same idea)
• read/write access control lists (per user & directory)
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 77
Annotations
• Log annotationsannotations (& possibly summary resultssummary results or metadatametadata) in the repository /data equivalence!/
• BriefBrief annotations – use cvs –m “Annotation” …
• Optional completecomplete annotations - MySQL servers for metadata (fast search for large data volumes e.g. with indexing)
• The users can browse (a subset of) the existing tags, inspect the annotations, and, if interested, reproduce the results (or just get the how-tohow-to), modify them etc.
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 88
Extensible Command Set
• Session commands• open <session>• close <session>
• During analysis• help <command>• browse <tag>• inspect <tag> <b|
c>• startlog• log <tag> <annot>• annotate <tag>• extract <tag>
• Administrative tasks• copy <tag> <from>
<to>• move <tag> <from>
<to>• delete <tag> <from>• archive <tag> <to>• retrieve <tag> <from>
CODESH commands:run, shellgetenv, getalias etctake/get snapshot
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 99
Working Releases - CODESH• Virtual log-book for “shell”
sessions• Parts can be local (private)
or shared• Tracks environment
variables, aliases, invoked program code etc during a session
• Reproduces complete working sessions
• Complex high energy physics examples for the CMS experiment operational
• E.g. all CMS data generations for a Florida group during several months are stored in CODESH and the knowledge is available
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1010
Working Releases - CAVES
Higgs W+W-
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1111
Working Releases
• The client codes are easy to download and install (stored in CVS) – checkout and build
• All you need to run the clients is ROOT 3.05 or higher and CVS (from v 1_0_0 client is fully ROOT-compliant); or Python 2.1 or higher
• Users can browse the virtual logbooks and reproduce the examples stored there
• They can play with e.g. new analyses and store them locally or on our server, or install new servers for groups collaborating on a project
• Stress test: CODESH repository with• 10 sessions – 1 sec to inspect a session
• 10 000 sessions – 2 sec to inspect a session
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1212
Prototype CMS Analysis
Developed a stand-alone C++ code to analyze data stored in tree format:• Lightweight, no external
dependencies besides ROOT, used for iGrid2005 and SC|05 demos and by users at UF for analysis and CMS production validation
• Sessions stored by CAVES in private or group repositories, data stored locally or on remote xrootd servers
• Fully distributed application
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1313
Outlook
• Work in progress – firstfirst CAVES and CODESH releases out
• We are looking forward to user feedbackuser feedback• Possible future directionsfuture directions:
• Different back-ends: web/grid service oriented (e.g. Clarens)• Extend remote data access: e.g. xrootd …• Add GSI security• Automatically convert session log to workflow• Tune on smaller samples, schedule on grid for larger tasks
A picture is better than 1000 words: Try out the releases !CAVES white paper arXiv: physics/0401007CODESH / CAVES paper arXiv: physics/0410226Virtual States and Transitions ICCS 2005, LNCS 3516, pp. 342-345,
2005; © Springer-Verlag
More info at http://cern.ch/bourilkov/caves.html
D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1515
Possible Scenarios
Case1: SimpleUser 1 : Does some analysis and produces a result with tag analX_user1. User 2: Browses all current tags in the repository and fetches the session stored with tag analX_user1.
Case2: ComplexUser 1 : Does some analysis and produces a result with tag analX_user1. User 2: Browses all current tags in the repository and fetches the session stored with tag analX_user1.User 2: Does a modification in the program obtained from the session of user1 and stores the same along with a new result with tag analX_user2_mod_code.User 1: Browses the repository, finds that his program was modified and decides to extract that session using the tag analX_user2_mod_code.This scenario can be extended to include an arbitrary number of steps and users in a working group or groups in a collaboration.