application-aware management of parallel simulation collections siu-man yau, ([email protected]), new...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Application-AwareManagement of
Parallel Simulation Collections
Siu-Man Yau, ([email protected]), New York University Steven G. Parker ([email protected]), University of Utah Kostadin Damevski ([email protected]), University of
Utah Vijay Karamcheti ([email protected]), New York University
Denis Zorin ([email protected]), New York University
Multi-Experiment Studies
• Existing (batch-based) systems treat each execution as a ‘black box’:– Issue one simulation at a time
• Application-aware system:– Schedule collection of simulations as a whole– Use application-specific knowledge for
scheduling and resource allocation decisions
• Application-awareness brings 4X improvement in response time
Outline
• Example MES: Helium Model Validation
• Evaluation platform: SimX System
• Application-specific considerations– Parallel overhead, Sampling, Result reuse,
Malleability
• Application-Driven Scheduling and Resource Allocation Strategies
• Conclusion
Helium Model Validation
• Gas mixing model for fire simulation
• “Knobs” on model: – Prandtl number– Smagorinsky constant– Grid resolution– Inlet Velocity– etc. . .
• To validate: compare Vs real-life experiment
Helium Model Validation
• Measure velocity profile from real-life experiment
• Pick two “knobs”– Prandtl number– Inlet Velocity
• Run simulated experiments
• Find the combination that match the profile at both heights
Evaluation platform: SimX
• System support for Interactive Multi-Experiment Studies (SIMECS)
• View computational study as a whole
• For parallel, distributed clusters– Workers (Simulation code & Evaluation code)– Manager (UI, Sampler, Resource Allocator)– Spatially-Indexed Shared Object Layer
(SISOL)
SISOLAPI
Front-end Manager Process
Worker Process Pool
User Interface: Visualisation &
Interaction
Sampler
ResourceAllocator
FU
EL
Inte
rfac
e
SISOL Server Pool
Data Server
Data Server
Data Server
Data Server
Dir
Ser
ver
TaskQueue
Simulationcode
FU
EL
Inte
rfac
eEvaluation
code
Evaluation platform: SimX
Application-Awareness
• Decision: How many processes for each task? • Application-specific considerations
– Minimize parallelization overhead: concurrent tasks, low parallelism
– Sampling strategy: task dependency: serial tasks, high parallelism
– Reuse opportunities: maximize “reusable” work: serial tasks, high parallelism
– Malleability: claim idle resource as beneficial
• Work against each other
Application-awareness
• Naïve approach: Assign one worker per task– Eliminate per-task parallelization overhead– Does not maximize reuse and sampling efficiency– Left over “holes”
• Naïve approach: Assign one task at a time to all workers – Maximize reuse potential and sampling efficiency– Maximize parallelization overhead
• Application-aware approach: Batching – Groups of tasks allowed to be concurrently executed
SISOLAPI
Front-end Manager Process
Worker Process Pool
User Interface: Visualisation &
Interaction
Sampler
ResourceAllocator
FU
EL
Inte
rfac
e
SISOL Server Pool
Data Server
Data Server
Data Server
Data Server
Dir
Ser
ver
TaskQueue
Simulationcode
FU
EL
Inte
rfac
e
Evaluation code
SimulationContainer
TaskQueue::AddTask(Experiment)
TaskQueue:: CreateBatch(set<Experiment>&)
TaskQueue::GetIdealGroupSize()
Reconfigure(const int* assignment)
Solution: Application-awareness
Batch for Sampling
• Identify independent experiments in sampler• Max. parallelism while allowing active sampling
First Batch
1st Pareto-Optimal
Second Batch
1st & 2nd Pareto Opt.
3rd Batch
1st to 3rd Pareto Opt.
4rd Batch
Pareto Frontier
Prantl Number
Inle
t V
eloc
ity
Batch for Result Reuse
• Sub-divide each batch into 2 smaller batches: – 1st sub-batch: first in reuse class; no two belong to
same reuse class– No two concurrent from-
scratch experiments can reuse each other’s checkpoints(max. reuse potential)
– Experiments in samebatch have comparable run times (reduce holes)
Prantl Number
Inle
t V
eloc
ity
Batch for Result Reuse
• Total time: 5 hr 10 mins
1st Batch
2nd Batch
3rd Batch4th Batch 5th Batch
6th Batch
Preemption
• Helium code is malleable: – Restart a checkpointed run on different number of
workers
• Preemption system:– Manager stores a database of idle workers in SISOL– Workers uses application knowledge to determine if it
should claim idle workers– Manager creates new worker group by adding idle
workers to group– Manager restarts the simulation on new group
Evaluation: Resource AllocationKnowledge used
Total time Utilization
Rate
Avg. time per run
Improvement
None (run on 1 worker)
12 hr 35 min 56.3% 6 hr 17 min N/A
None (run 1 experiment)
20 hr 35 min 100% 34.3 min N/A
+ Active Sampling
6 hr 10 min 71.1% 63.4 min 51% / 70%
+ Reuse classes
5 hr 10 min 71.3% 39.7 min 59% / 75%
+ Preemption 4 hr 30 min 91.8% 34.5 min 64% / 78%
Conclusion
• Application-awareness yields up to 4+ times improvement in response time
• Conclusions: – View from application level important– Domain knowledge important– System API and infrastructure to exploit
domain knowledge important • Task Queue API for batching• SISOL & Resource Allocator API for pre-emption