component updates, sss-oscar releases, api discussions, external users, and scidac phase 2

23
Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist January 25-26, 2005 Washington DC

Upload: madonna-finley

Post on 31-Dec-2015

17 views

Category:

Documents


1 download

DESCRIPTION

Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2. Al Geist January 25-26, 2005 Washington DC. Resource Management. Accounting & user mgmt. System Monitoring. System Build & Configure. Job management. ORNL ANL LBNL PNNL. SNL LANL Ames. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Component updates, SSS-OSCAR Releases, API Discussions, External Users, and

SciDAC Phase 2

Component updates, SSS-OSCAR Releases, API Discussions, External Users, and

SciDAC Phase 2

Al GeistJanuary 25-26, 2005

Washington DC

Page 2: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

IBMCrayIntelSGI

Scalable Systems SoftwareScalable Systems Software

Participating Organizations

ORNLANLLBNLPNNL

NCSAPSCSDSC

SNLLANLAmes

• Collectively (with industry) define standard interfaces between systems components for interoperability

• Create scalable, standardized management tools for efficiently running our large computing centers

Problem

Goals

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

ResourceManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring

www.scidac.org/ScalableSystems

To learn more visit

Page 3: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCIBMSGI

SNLLANLAmesNCSA

CrayIntel

Participating OrganizationsParticipating Organizations

Running suite at ANL, and AmesRunning components at PNNLMaui w/ SSS API (3000/mo), Moab (Amazon, Ford, TeraGrid, …)

How do we position ourselves with respect to the - National Leadership-class facility?

NLCF is a partnership between ORNL (Cray), ANL (BG), PNNL (cluster)

- NERSC and NSF centers

Page 4: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Goals for This MeetingGoals for This Meeting

Updates on the Integrated Software Suite components Preparing for next SSS-OSCAR software suite release quarterly releases this year

Planning for SciDAC phase 2 – whitepaper and meeting with MICS director

Results of Scalability tests Warehouse

Discussion of Less Restrictive Syntax

Page 5: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Grid Interfaces

Accounting

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

SchedulerNode StateManager

AllocationManagement

Process Manager

UsageReports

Meta Services

System &Job Monitor

Job QueueManager

NodeConfiguration

& BuildManager

Standard XML

interfacesauthentication communication

Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite

Checkpoint /Restart

Validation & Testing

HardwareInfrastructure

Manager

SSS-OSCAR

Scalable Systems Software SuiteScalable Systems Software SuiteAny Updates to this diagram?Any Updates to this diagram?

Page 6: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Highlights of Last Meeting (Aug. 26-27 at ANL)Highlights of Last Meeting (Aug. 26-27 at ANL)

Details in Main project notebook

FastOS presentations - SNL, ORNL, ANL, LBL, and LANL Discussed what they proposed and how SSS can help.

SC04 Suite Release – version 1.0. What do we show/demo at SC?

API Discussions - Less Restrictive Syntax introduced SSSRMAP version 3 described.

Discussion of – accomplishments, the need to get our software on large clusters, priorities for wrapping up project.

Page 7: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Since Last MeetingSince Last Meeting

• “Hackerfest” meeting to prepare release 1.0• October 6-8 at ORNL

• SC04 SSS posters and demos• Ames, ANL, LBL, PNNL

• Telecoms• Every Tuesday - Resource Management group• Every other Thursday – Process Management group

• New entries in Electronic Notebooks • Five notebooks provide a dynamic SSS web site• Over 360 pages of ideas, APIs, and meeting notes

Page 8: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Major Topics for This MeetingMajor Topics for This Meeting

• Latest news on the Software Suite components• Preparing for third SSS-OSCAR software suite release • Planning for SciDAC phase 2 and meeting with MICS

director February 17• whitepaper • all day meeting with 1 ½ hour for SSS

• Presentation of System and Job Monitor API• Presentation of Less Restrictive Syntax • Vote on Process Manager API

• Also discuss getting our components out on large clusters • Are components robust enough to use at NLCF or NERSC?• Ssslib version with RMAP• Fred asks if we can incorporate NWPerf, and OTP security

Page 9: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Agenda – January 25Agenda – January 25

8:00 Continental Breakfast 8:30 Al Geist - Project Status 9:00 Fred Johnson - MICS report, next steps in scidac 9:30 Scott Jackson - Resource Management components 10:30 Break 11:00 Will Mclendon - Validation and Testing 12:00 Lunch (on own in hotel) 1:30 Paul Hargrove Process Management and Monitoring 2:30 Narayan Desai - Node Build, Configure 3:30 Break 4:00 Craig Stefan - "Warehouse" system monitoring package 5:00 Rusty Lusk – Present Less Restrictive Syntax,

5:30 Adjourn for dinner

Page 10: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Agenda – January 26Agenda – January 26

8:00 Continental Breakfast 8:30 Thomas Naughton - Preparing for next SSS OSCAR

software release 9:00 Rusty - discussion and vote on Process Manager API 9:30 incorporating NWperf into SSS suite10:00 Group discussion of whitepaper, closure of this project,

and ideas for what to propose next.

10:30 Break 11:00 Discussion - finalizing the Whitepaper,

Meeting with Mike Strayer in February 17, SciDAC PI Meeting in June (invited to give poster), Set next meeting date: May 10-11 in SanFran or ANL

12:00 Meeting Ends

Page 11: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

CS ISIC Presentations February 17CS ISIC Presentations February 17

Presentations will be made to head of SciDAC by all four CS ISICSOne hour presentation followed by 30 minutes for discussionEach will also prepare a whitepaperOutlineWhat are the System Software challenges for SciDACGoals of project standardized and flexible API modular architecture (portable across HW) scalable reference implementationHighlights (include impact) XML based API independent of language and protocol (ssslib) architecture that allows plug and play (SD, EM, meatball) suite releases (SSS-OSCAR, integrated components) production users (ANL, Ames, PNNL, NCSA?, others?) adoption of API by existing products (Maui, Moab)Future CS ISIC ideas National Leadership Computing facility (Cray, IBM BG, SGI, clusters) FastOS (discussions from last meeting) Cray software roadmap (Al and Rusty attending from SSS)

1 page

2 pages

1 page

Page 12: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

CapabilityPlatform

BreakthroughScience

UltrascaleHardware

Rainer, Blue Gene, Red Storm HW teams

ComputationalScience TeamsSoftware & Libs

SW teams

Tuned codeResearch

teamHigh-End science

problem

SciDAC Science teams

Unified Computing EnvironmentCommon look&feel across diverse HW

Page 13: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Al Geist – presents project overview and goals for this meetingFred – what’s going on in MICS. President’s FY06 budget out in couple weeks. $80B going to the war – budgets going to be tight rest of decadeSciDAC and CS base program are not going to be impacted(they avoid the otherwise 10% cut across the board)Viz and data management taken over by Fred in MICS officeRecruiting second Math PM in MICS office to help GaryEd Oliver is being replaced this summer. New Secretary of EnergySciDAC PI mtg: Late June in San Francisco (URL)Much broader – emphasis on scienceMath ISICs held get together to discuss SciDAC++ in DecemberAnd generated common whitepaper for Strayer who has Math interestCS ISIC he has broader view. What Is going in the ISICsUp to 4 people per ISIC will meet in building 3 blocks from HQIssues: first round scidac got Math ISICs and CS ISICs Should scidac++ have a more blended ISICs, we have 5 yrs experienceWhitepaper 2-5 pages written to Michael who is advocate for SciDACVision, progress, success/impact, gaps/opportunities

Page 14: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notesWhat is the vision changes in scidac++Petascale systems, blended ISICs, gaps that you (Fred) seeHaven’t sat down and asked What are the scientific goals yetSciDAC++ Call in early OctoberCS gaps – SDM understanding gigantic data sets (viz and analysis)- should viz be a part of scidac++ ? Common viz infrastructure needed?- PERC ongoing emphasis on understanding scidac app performance- SSS understand what happens with and without system software- how to handle SGI (NUMA) if it matters- first 5 yr lot of infrastructure build up – what next step- Getting vendor buy in, plus Linux cluster strategy- CCA transition of apps, what is rate of adoptionSciDAC is a layered approach. Math and CS working with apps teamsISICs initial impact from math side of house, CS would take longerWe have had time so now how do we help?In future we will have Large Heterogeneous systems across DOEParallel File system research do we need to add to scidac++ ?

Page 15: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Compute plateform in scidac-1 was singular NERSC IBM.SciDAC++ this will be much more challenging, heterogeneousHow much integration SSS can help with? Incl. File systemPortability and Common software base/environment on DOE systemsTo Strayer - Site wants to use PBS-pro how to do this with SSS-Lines of code in SSS, RAS strategy, HPSS support, data migration-Number of daemons (scalability) and any kernel mods?Security and NWPerf are two things to consider for SSS

Page 16: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Scott Jackson – Resource Management componentsPython based SSSRMAP SDK almost finished – need integrate w/ other SDKCraig initial efforts on SSSRMAP integration into ssslibGeneral evidence of adoption and value of SSS componentsBamboo – supports chkpt/restart, PBS/Loadleverler syntaxGold –production use on multiple PNL systems incl. 11.8 TF cluster

dozens of downloads, began discussions with DOD HPCMPMaui – support for chkpt, enhanced prioritization, throttling, and QOS

installed on 2,500 clusters, downloaded 100,000running more supercomputers than any other in world17 of top 20 and 75 of top 100 in top 500 list.

Page 17: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Scott Jackson – Resource Management components (cont)Full support for SSSRMAP v3Gold improvements over Qbank, improved robustnessAdded support for SQLite embedded database (fast as postgress)New reservation designMaui has added buffer overflow prevention for securitySilver – metascheduler also called Grid scheduler using SSS job object and message communication protocols. Available for release 3 monthsHandles cross site data staging, grid fairness, multi-cluster job allocationMCOM common library between Maui and SilverFuture work Portability testing – Linux (Red Hat fading out) Fedora, Sosa,

AIX, Tru-64, OS-XFault Tolerance supporting 25% cluster lossMultisite authentication and authorizationFuture work focused on Silver development

Page 18: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Will McClendon – Validation and TestingMostly doing bug fixing on current release v0.2.5Test driver tool for testing software – ordered tests and API testsAPItest is now in SSS-OSCAR v1.0Finishing up User GuideGives overview of APItest and screenshotsFuture workBeing able to diff filesConfiguration fileAdd more SSS component testsHe gives a short demo

Ron Oldfield – setting up SSS integrated test suitesHired contractor full time this month to durability and performance testsHe presents talk on “Lightweight File System Project” w/ Lee Ward, himselfRisk mitigation for red storm FS (Lustre)Initial focus on secure storage architecture (not a FS) – nice work here.Describes this projectLee Ward knows the answer but can’t express itBarney Mcabe knows the answer but it is wrong.

Page 19: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Paul Hargrove – Process Management componentsCheckpoint Manager – BLCR statusHandles Files if unmodified between checkpoint and restart, or only appended between, or pipes between processes.MPI if LAM/MPI7.x over tcp and GM including migratable task if OpenMPI since it will inherit LAM/MPI support if ChaMPIon/Pro (Verari)Platforms IA32 only using (future x86_64 but no plans for IA64)Linux Stock 2.4.x, SuSE7.2-9.0, RedHat7.2-9, RHEL3/CentOS3.1 2.6.x port in process (FC2 & SuSE 9.2)Future workCover process groups and sessionsHandle directories and mutable filesSSS integration chpt manager works w/ Bamboo, Maui, MPDPM upgrade API to LRSProcess manager being hardened and converted to LRS preparation for BG/L (one full rack)

Page 20: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Craig – Warehouse componentsStarting to monitor does not wait for all connections to finishConnection and monitoring thread pools are independentFuture need to add “full reset”Any component can be restarted no longer depends on start order.Testing Ran on Platinum cluster at NCSA on 120 nodes Infinite Itanium cluster (128 processors) T2 cluster (Dell Xeon 500+ nodes) xtorc warehouse test did autodiscovery of new clientsDavid Boxer, RA, doing main programming on warehouse. IBM hired him.Future WorkAPI to Node State Manager Intelligent error handling

Page 21: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Narayan Desai – Build configure componentsConversion to LRS is in process – not trivialBG/L arrived – will run SSS software but there are some issuesSingle process/node, no direct TCP support, RAS interface unusual,allocation granularity must be 2^nCompute node OS is reloaded for each job (like Scyld model)Chiba RM (scheduler, QM, allocation manager) used as-isNew implementation of PM – system partitioning, BG/L kernel loading, PMI implementationNew configuration management components different diagnostic model

Craig Stefan – New Warehouse information storehouseNode info list and Resource ConsumersGoes over a couple examplesNew protocol design and sender DONEProtocol parser NOT DONE

Page 22: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Rusty Lusk – Less Restrictive SyntaxBest of both words. See notes from last meeting on discussion of Two families of syntax.Command language in XML that:Identify a set of objects, specification of function, construct responseDesirable features – completeness, validation, readability, concisenessGoes over example of differences between RS and LRSGoes over the BNF (detailed normal form)Then goes over several examples of LRS. Much more readable.LSF needs a better name.Suggestion by Paul to call it “S5”

Thomas Naughton – SSS OSCARV1.0 release Nov 04 with all SSS components represented.Preparing v1.1 release for Feb 15 – still oscar3.0 basedShift to oscar4.1 in v1.2 release in 2Q 2005Future: extend SSS component tests, improve documentation, orderingLonger term – support more Linux types, make it an OSCAR “package set”Release dates this year (v1.2 Fedora core2 May 15), Aug 15, v2.0 SC05

Page 23: Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2

Meeting notesMeeting notes

Rusty Lusk – Process Manager API and voteGoes over the spec as written in the Process Manager NotebookFunctionality start/stop process groups, query state of job, deliver signals.Uses Less Restrictive Syntax for its five commandsCreateProcessGroupGetProcessGroupSignalProcessGroupKillProcessGroup WaitProcessGroupDatatypes: ProcessGroup and ProcessGroupSpecificationEvents: start of job and end of jobGo through some examples w/ discussionVote to accept Process Manager APIYes- 12 No-0 Abstain-0

Al Geist – CS gaps for the whitepaperUnified Software Environment across diverse systems including development (interactive) environmentOther gaps are I/O, Fault tolerance, and security