tigres: template interfaces for agile parallel data ......tigres: design templates for common...
TRANSCRIPT
Tigres: Template Interfaces for Agile Parallel Data-Intensive Science
Lavanya Ramakrishnan
http://tigres.lbl.gov
1
Tree Files (ps and .pdf files)
blast blast
clustalw clustalw
dnapars protpars
drawgram
drawgram
(CS Biased) View of Workflow Challenges: Gene2Life Molecular Biology Analysis
• Mostly simple sequential workflow
• Repetitive • Tracking
– Provenance, metadata, etc
• “Iterative” – Swap programs and
data sets • Desktop to HPC/
Cloud
Nucleotide or amino acid.
search
alignment
analysis
visualization
Pre Interproscan
Interproscan Interproscan
Post Interproscan
Motif
…
N=135
256 processors
(CS Biased) View of Workflow Challenges: MotifNetwork
• Mid-sized “compute-intensive” workflow
• Mix of single processor and multiprocessor tasks
• Intermediate data formatting/logic
• Move • Share
data preparation
analysis
aggregation
processing
People use ad-hoc scripts, keep notes in text files and encode metadata in file names
Source: Jeff Tilson
Big Data is here … Larger volumes of data More dynamic content Significant variety in data types Large amounts of unstructured data Increased rate of data arrival Need for faster data processing rates …..
… It is not getting easier
MapReduce and Hadoop Ecosystem
Map
Reduce
Computation performed on large volumes of data in parallel Provides scaling, data locality, fault tolerance Higher-level tools have evolved for specific data analysis There are challenges in using MapReduce/Hadoop
for scientific workflows
Tigres: Design templates for common scientific workflow patterns
"LightSrc" Domain templates
Base Tigres templates
Scale up
Application "LightSrc-1"
Application "LightSrc-2"
Create andDebug
Share
Create andDebug
Implement templates as a library in an existing language
Tigres Templates
TaskN
Task1 Sequence
Taskn Task1 ... ...
Split
Parallel
TaskN Task1
Task
Merge
Tasko
Taskn Task1
Key Aspects of Tigres
• Targeted for large-scale data-intensive workflows – Motivated by “MapReduce” model – No centralized managed model
• Library model embedded in existing languages such as Python and C – “Extend current scripting/programming tools” – API-based, embedded in code
• Light-weight execution framework – “As easy to run as an MPI program on an HPC resource” – No persistent services
• Scientist-Centered Design Process – Get feedback from user before writing all the code
Design
Execution Environment
API Implementation Optimizations Scientist-Centered
Design Process
Tigres Design Process
Create a workflow 1. Define input types 2. Define task 3. Assign input values
4. Repeat 1-3 above for other tasks
5. Create appropriate input
arrays 6. Create appropriate tasks
arrays 7. Create (and run) the
template
Task2
Task1
Task45 ...
Task3
Task40
Task55
Task50 ...
Task6
input1_task1 {Type: Object_a} input2_task1
{Type: int}
Templates
• Sequence ( name, task_array, input_array ) – e.g., output [ ] = Sequence (“my seq”, task_array_12,
input_array_12) • Parallel ( name, task_array, input_array )
– e.g., output[ ] = Parallel(“abc”, task_array_12, input_array_12)
• Split ( name, split_task, split_input_values, task_array, task_array_in ) – e.g., Split( “split”, task_x1, input_value_1, spl_t_arr,
spl_i_arr) • Merge ( name, task_array, input_array, merge_task,
merge_input_values) – e.g., Merge( “merge”, syn_t_arr, syn_i_arr, task_x1,
input_value_1)
Scientist-Centered Design Process
• Use Google docs for an interactive step-by-step exercises with “facilitator” and “human compiler” – white/black board didn’t work
• Preparation – terminology, basic template, example, exercise – 15 minutes preparation time
• Testers ( ~6) – Developers, web design/UI staff, application scientists
Concept understanding by user Changes to Nomenclature Support in C also important
Priorities for first prototype: Desktop to NERSC Monitoring Intermediate state management
Impact of Scientist-Centered Design
Design
Execution Environment
API Implementation Optimizations Scientist-Centered
Design Process
What did we learn?
• Documentation clarity was key – Majority of our participants “coded-by example”
• Nomenclature was important – Confusion with initial terminology
• Keep API simple – Dependencies/output – two different styles within and
outside a template can be confusing • Support extended API
– Optional parameters, different programming styles • Execution Semantics were important
– Monitoring, logging It took days for first stub implementation rather
than weeks (or months)!
Summary
• “Scientist-friendly” programming API to manage workflows
• Plan to test API with different user groups
• Core team – Deb Agarwal (PI), Lavanya Ramakrishnan, Daniel Gunter – Gilberto Pastorello, Valerie Hendrix, Ryan Rodriguez
• CS Research groups – Shane Canon – John Shalf
• Science research groups – Cosmology - Alex Kim, Rollin Thomas, Stephen Bailey – Gamma Ray - Dan Chivers – Advanced Light Source - Dula Parkinson – HEP - Paolo Calafiura – Materials – Kristin Persson
Tigres Team
Website: http://tigres.lbl.gov Contact: [email protected]
Monitoring
• Initialize – init (tigres-destination, user-destination)
• User Logging – setLevel(level) enumeration FATAL upto TRACE – write(level, name, key-value pairs)
• Query – getStatus(type, names) – getInfo(name, key-value-pairs)
Input and Task
• InputTypes ( name, types[ ] ) – e.g., input_type1 = InputTypes(“Types1”, {“int”, “string”})
• InputValues ( name, values[ ] ) – e.g., input_value1 = InputValues(“Values1”, {1, “hello”})
• InputArray ( name, input_values[ ]) – e.g., input_array_12 = InputArray(“Array12”, {input_value_1, input_value_2})
• Task ( name, type, impl_name, input_types, env) – e.g., task_f1= Task(“A”, FUNCTION, “myfunc”, input_type1))
• TaskArray ( name, task[ ] ) – e.g., task_array_xy = TaskArray(“xy”, {task_f1, task_x1})