resource management analysis and accounting mike showerman, mark klein joshi fullop and jeremy enos...
TRANSCRIPT
![Page 1: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/1.jpg)
Resource Management Analysis and AccountingMike Showerman, Mark Klein Joshi Fullop and Jeremy EnosNCSA Blue [email protected]
![Page 2: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/2.jpg)
Interfaces to Accounting data
• Allocations and accounting database• Command line• Portal
• User interface
• Spreadsheets of operational metrics• For reporting to management (going away I hope)
• Integrated System Console
2
![Page 3: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/3.jpg)
3
![Page 4: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/4.jpg)
Complex policies can cause confusion
• Prioritizing specific workloads leads to inefficiencies – That’s OK• We are capability job focused
• Can be challenging to determine source of utilization drop• Policy
• System issue
• Available workload
4
![Page 5: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/5.jpg)
Full feature view
5
![Page 6: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/6.jpg)
More typical view
6
![Page 7: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/7.jpg)
Feature descriptions
7Xtreme Accounting
Draining Missing or mismatch
Unavailable Drain Thrashing
![Page 8: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/8.jpg)
Moab Torque Alps
8
Source CUG 2013 paper by Matt Ezell (ORNL)[email protected]
Allocation and accounting database
Allocation and accounting database
qsubqsub
![Page 9: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/9.jpg)
Challenges in usage data:A man with one source of usage data knows his usage• MAM interface (previously gold)
• The good• Realtime logging allocation integration
• Includes reservations in accounting
• The bad• Undocumented
• No retry if data fails to send• Path goes across HSN. You would be shocked how often there is a
failure sending data to an outside server.
• Incomplete data (gres and more)• Components communicate but do not coordinate
9
![Page 10: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/10.jpg)
Semi-Manual analysisAlpsevents Query
JobId Date Epoch Event ResId apid
485968 2013-11-13 23:33:37
1384407217 bound 1700 0
485968 2013-11-13 23:33:39
1384407219 placed 1700 2418935
485968 2013-11-13 23:33:40
1384407220 released 1700 2418935
485968 2013-11-13 23:33:40
1384407220 placed 1700 2418936
485968 2013-11-13 23:35:03
1384407303 released 1700 2418936
485968 2013-11-13 23:35:03
1384407303 canceled 1700 2418934
485968 2013-11-13 23:35:04
1384407304 removed 1700 2418934
10
Accounting Database
job_id login account machine group_name start_time end_time queue walldurati
on charge nodes processors qos
485968 fiedler jme nid11293 vendor_cray
11/13/2013 11:33:03
PM
11/13/2013 11:55:30
PM
normal 1099 6.11 20 640 sub_25p
shredded_job_pbsshredded_job_
pbs_idjob_i
djob_array_i
ndexhos
tqueu
e user groupname ctime qtime start end etime exit_stat
ussession
jobname owner accou
nt exec_host resources_used_vmem
resources_used_mem
resources_used_walltimeu
resources_used_nodes
resources_used_cpus
resources_used_cput
resource_list_nodes
resource_list_neednodes
resource_list_walltime
185482 485968
-1 BW normal
fiedler
vendor_cray
1384406706
1384406706
1384407217
1384407304
1384406706
0 22725
test_links
fiedler@h2ologin1
jme NodesRemoved
157659136 11030528 87 20 640 1 20:ppn=32:xe 20:ppn=32:xe 300
![Page 11: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/11.jpg)
Integrated System Console
• Does a wide array of tasks• Just focusing on relevant parts
• Event an log processing ad storage engine• Trigger alters based on event templates
• Parse and store logged data• Moab/torque/alps/hsn/storage/nodes
11
![Page 12: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/12.jpg)
ISC will do more
• Integrated data give us the power to make better decisions• When alps sees more than 1 cancel… is the
filesystem down, if so, take accounting action, if not alert
• Moab time/alps time/torque time out of sync… Adjust charging?
• Filesystem issue, should walltime limits be increased?
• Hole in torus… prevent some jobs from starting?
12
![Page 13: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/13.jpg)
Additional data to collect
• Where is the time really going?• Sources:moab/torque/alps issue,hsn,filesystem
• Do you account for it in detail? Suspect time?
13
Job TimeJob Time OverheadOverhead
Moab TimeMoab Time
Torque TimeTorque Time
Alps TimeAlps Time
User TimeUser Time
![Page 14: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu](https://reader035.vdocuments.us/reader035/viewer/2022070414/5697c0131a28abf838ccca8f/html5/thumbnails/14.jpg)
Node state Accounting
• Job failures can cause nodes to become suspect• Often very large subsets of the system
• Overhead has not been quantified
• Extension of the SDB database to trigger on state changes• Store node state change data in ISC
• Account for reduced availability• Begin collecting MTTI data
14Xtreme Accounting