virtual mpirun jason hale engineering 692 project presentation fall 2007
TRANSCRIPT
Rational Compute cycles = money
Mimosa (250 nodes): $.06 per CPU hour
Wasted CPU Cycles ->Wasted MoneyWasted User Time -> Less Research
Not all parallel computations run efficiently
Goal of a Supercomputing Center:Have users run on the max number of CPUS/Nodes they can utilize efficiently
Percentage of MCSR Jobs Using g03sub (Gaussian)
# g03sub PBS jobs, 26,691, 88%
# other jobs, 3,724, 12%
MCSR Initiatives to Improve Utilization
g03sub• Enhanced (virtualized?) wrapper for users submitting
Gaussian calculations
Back-end Processes to poll PBS batch scheduler to compute utilization of parallel jobs; post to DB & Web; e-mail inefficient Users
Amber Alert System
Average Efficiency of Parallel G03 Calculationson Scalar MCSR Systems (Redwood, Sweetgum)
56%
56%
57%
57%
58%
58%
59%
59%
60%
60%
61%
2005 2006 2007
These Systems Don’t Work for Mimosa Cluster
PBSPro can’t accumulate CPU usage times from parallel processes distributed across compute nodes
Idea: Create a monitor process that will follow parallel processes to nodes, monitor their CPU performance, and report back.
Virtualization: Users will not know about the process. They will launch a virtual mpirun (or g03sub), not realizing that is not the “real” one, and it will launch the real one along with the monitor
myprogram.exe
Running an MPI Program on a Cluster
myprogram.c
cc myprogram.c –o myprogram.exe
myscript.pbs
#PBS –l nodes=4
mpirun –np 4 myprogram.exe
myscript.pbs
qsub myscript.pbsVirtual mpirun
mpirun –np 4 monitor.exe &
mpirun –np 4 myprogram.exe
monitor.exe
monitor.exe
monitor.exe
monitor.exe
monitor.exe
myprogram.exe
myprogram.exe
myprogram.exe
myprogram.exe
Compute Nodes
myscript.pbsHead Node
Design Goals
Collect CPU utilization stats on cluster calculations
No changes to user end processes
No significant performance degradation
No side effects (Leave No Trash Behind)
Monitor even non-MPI parallel codes (Gaussian 03)
Generality and robustness for reuse potential
Components
monitor (new C++ MPI program)
mpirun • New wrapper around existing mpirun
• Calls existing monitor and “real” mpirun
g03sub• existing batch script to launch Gaussian jobs on cluster
• MCSR’s version previously “virtualized”
• modify to now call monitor program also
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
Manager Process
Worker Processes
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
Manager Process
Worker Processes
Worker Process Logic
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
Manager Process
Worker Processes
While (NoTerminationMessageFromMaster) Sleep Wakeup Create Process Times File Read Process Times File If (ActiveProcesses) Update Process Times Data Structure SendCPUTimeMessageToMaster Else SendIdleMessageToMaster End If/ElseEnd WhileTerminate
Worker Process Logic
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
Manager Process
Worker Processes
/tmp/ps_file
/tmp/ps_file
/tmp/ps_file
While (NoTerminationMessageFromMaster) Sleep Wakeup Create Process Times File Read Process Times File If (ActiveProcesses) Update Process Times Data Structure SendCPUTimeMessageToMaster Else SendIdleMessageToMaster End If/ElseEnd WhileTerminate
Worker Process Logic
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
Manager Process
Worker Processes
/tmp/ps_file
/tmp/ps_file
/tmp/ps_file
While (NoTerminationMessageFromMaster) Sleep Wakeup Create Process Times File Read Process Times File Delete Process Times File If (ActiveProcesses) Update Process Times Data Structure SendCPUTimeMessageToMaster Else SendIdleMessageToMaster End If/ElseEnd WhileTerminate
Worker Process Logic
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
Manager Process
Worker Processes
/tmp/ps_file
/tmp/ps_file
/tmp/ps_file
While (NoTerminationMessageFromMaster) Sleep Wakeup Create Process Times File Read Process Times File Delete Process Times File If (ActiveProcesses) Update Process Times Data Structure SendCPUTimeMessageToMaster Else SendIdleMessageToMaster End If/ElseEnd WhileTerminate
pid cputime123 06s 124 12s 130 29s
= 47s total
Worker Process Logic
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
Manager Process
Worker Processes
While (NoTerminationMessageFromMaster) Sleep Wakeup Create Process Times File Read Process Times File Delete Process Times File If (ActiveProcesses) Update Process Times Data Structure SendCPUTimeMessageToMaster Else SendIdleMessageToMaster End If/ElseEnd WhileTerminate
pid cputime123 06s 124 12s 130 29s
= 47s total
47
9
Worker Process Logic
monitor.exe
myprogram.exe
monitor.exe
monitor.exe
myprogram.exe
monitor.exe
myprogram.exe
Manager Process
Worker Processes
While (NoTerminationMessageFromMaster) Sleep Wakeup Create Process Times File Read Process Times File Delete Process Times File If (ActiveProcesses) Update Process Times Data Structure SendCPUTimeMessageToMaster Else SendIdleMessageToMaster End If/ElseEnd WhileTerminate
Idle
Manager Process Logic
monitor.exe
myprogram.exe
Manager ProcessWhile (Active Processes)MONITOR_LOCAL_PROCESSESIf (LocalActiveProcesses) UpdateGlobalCPUTimeStructure UpdateActiveProcessesStructureElse UpdateActiveProcessStructureEndIf
ForEachSlave WaitForMessage If (CPUMessage) UpdateGlobalCPUTimeStructure UpdateActiveProcessStructure Else If (IdleMessage) UpdateActiveProcessStructure End IfEnd For
EndWhile
WKR cputime0 25s 1 35s 2 09s 3 47s
Test MPI Script
Parallel Ultimate Virtual Collapse Program
• Reads a list of integers from a file• Distributes the integers to all available worker nodes• Each worker computers the ultimate collapse of its numbers
Control the length of processing time by:• Number of numbers in the list (1,000,000)• The size of the numbers in the list (1 to 7 digits)
Control the parallel efficiency by:• The order of the numbers in the list.
• Larger numbers grouped together – fewer nodes to most of the work• Large numbers evenly distributed – nodes do about the same work
Project Status
Test Program is Written (Ultimate Collapse) Monitor program: Partially Complete; Some Work Remains
Sleep/WakeupCreate Process Times FileRead Process Times FileDelete Process Times FileIf (ActiveProcesses) Update Process Times Data Structure SendCPUTimeMessageToMaster Else SendIdleMessageToMaster End If/ElseTerminate
ps syntax from monintor.cpp
string psCommand (" ps -u " + username + " --no-headers -o pid,cputime,etime,comm,user,c,pcpu | grep -v ps | grep -v sh | grep mpirun | grep -v mon.exe | grep –v grep >> " + myFileName);
system(psCommand.c_str());