process management & monitoring wg quarterly report january 25, 2005

17
Process Management & Monitoring WG Quarterly Report January 25, 2005

Upload: chloe-richard

Post on 04-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Process Management & Monitoring WG Quarterly Report January 25, 2005

Process Management & Monitoring WG

Quarterly Report

January 25, 2005

Page 2: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 2

Components Process Management

Process Manager Checkpoint Manager

Monitoring Job Monitor System/Node Monitors Meta Monitoring

Page 3: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 3

Component Progress

Checkpoint Manager (LBNL) BLCR

Process Manager (ANL) MPDPM

Monitoring (NCSA) Warehouse

Page 4: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 4

Checkpoint Manager:BLCR Status

Full save and restore of CPU registers Memory Signals (handlers & pending signals) PID, PGID, etc Files (w/ limitations) Communication (via MPI)

Page 5: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 5

Checkpoint Manager:BLCR Status (Files)

Files Files unmodified between checkpoint

and restart Files appended to between

checkpoint and restart Pipes between processes

Page 6: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 6

Checkpoint Manager:BLCR Status (Comms)

LAM/MPI 7.x over TCP and GM Handles in-flight data (drains) Linear scaling of time w/ job size Migratable

OpenMPI Will inherit LAM/MPI’s support

ChaMPIon/Pro (Verari)

Page 7: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 7

Checkpoint Manager:BLCR Status (Ports)

Linux only “Stock” 2.4.X RedHat 7.2 – 9 SuSE 7.2 – 9.0 RHEL3/CentOS 3.1 2.6.x port in progress (FC2 & SuSE 9.2)

x86 (IA32) only today x86_64 (Opteron) will follow 2.6.x port Alpha, PPC and PPC64 may be trivial No IA64 (Itanium) plans

Page 8: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 8

Checkpoint Manager:BLCR Future Work

Additional coverage Process groups and Sessions (next priority)

Terminal characteristics Interval timers Queued RT signals

More on files Mutable files Directories

Page 9: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 9

Checkpoint Manager:SSS Integration

Rudimentary Checkpoint Manager Works with Bamboo, Maui and MPDPM Long delayed plans for “next gen”

Upgraded interface spec (using LRS) Management of “context files”

lampd mpirun replacement for running LAM/MPI

jobs under MPD

Page 10: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 10

Checkpoint Manager:Non-SSS Integration

Grid Engine DONE by 3rd party (online howto)

Verari Command Center In testing

PBS family Torque: Cluster Resources interested PBSPro: Altair Engineering interested (if funded)

SLURM Mo Jette of LLNL interested (if funded)

LoadLeveler IBM may publish our URL in support documents

Page 11: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 11

Process ManagerProgress (ANL)

Continued daily use on Chiba City, along with other components

Miscellaneous hardening of MPD implementation of PM, particularly with respect to error conditions, prompted by Intel use and Chiba experience

Conversion to LRS, in preparation for presentation of interface at this meeting

Preparation for BG/L

Page 12: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 12

Monitoring at NCSAWarehouse Status

Network code has been revamped; that code is in cvs in oscar sss

Connections are now retried Starting to monitor does not wait for all connections to

finish Connection and monitoring thread pools are independent No full reset (if lots of nodes are down, continues blindly) Any component can be restarted. Restart no longer

depends on start order. Features intended for sss-oscar 1.0 (SC2004), didn't

make it, made it into 1.01

Page 13: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 13

Monitoring at NCSAWarehouse Testing

Warehouse run on former Platinum cluster at NCSA

Node count kept dropping 400 nodes originally 200 nodes in post-cluster configuration 120 available for testing

Ran on 120 nodes with no problems Have head node, but cannot have whole

cluster So didn't try sss-oscar

Page 14: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 14

Monitoring at NCSAWarehouse Testing (2)

"Infinite" Itanium cluster (Infiniband development machine)

Have root access Will run warehouse for sure, for long range testing Might try whole suite (semi-production)

T2 cluster (Dell Xeon 500+ nodes) May run warehouse across (Mike Showerman says)

Anecdote: Went to test new warehouse_monitor on xtorc. Installed and

started new warehouse_monitors on nodes. Called up warehouse_System_Monitor to make sure it wasn't running. The already running System Monitor had connected to all the new warehouse_monitors and everything was running fine.

Page 15: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 15

MonitoringWork at NCSA

David Boxer, RA, working on warehouse Craig worked bugs and fiddly things, David did

development heavy lifting Revamped network code (modularized) Developed new info storage (more on this in the afternoon) New info store and logistics

info store: redesigned and updated: DONE protocol re-designed: DONE send protocol: DONE receive protocol: still to do

IBM offered him real money - he's off to work for them.

Page 16: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 16

MonitoringWork at NCSA

Wire Protocol: I (Craig) need to have working knowledge of

signature/hash functions. When I do, I'll be back to coding on this

Perilously close to being able to do useful stuff

Documentation: Have most of a web site written with

philosophy of warehouse, and debugging tools.

Page 17: Process Management & Monitoring WG Quarterly Report January 25, 2005

January 25, 2005 PMWG Quarterly Report 17

Monitoring at NCSAFuture Work

New interaction (to come): Node Build and Config Manager On start-up, will talk to Node State Manager and get

list of up nodes Subscribe to Node State Manager events for updates For now, can continue to store node state, transition to

Scheduler obtaining state information itself. Also to come:

Intelligent error handling (target-based vs. severity based)

Command line debugging/control?