quick overview of npaci rocks philip m. papadopoulos associate director, distributed computing san...
TRANSCRIPT
![Page 1: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/1.jpg)
Quick Overview of NPACI Rocks
Philip M. PapadopoulosAssociate Director, Distributed Computing
San Diego Supercomputer Center
![Page 2: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/2.jpg)
Seed Questions
• Do you buy-in installation services? From the supplier or a third-party vendor? – We integrate. Easier to have vendor integrate larger
clusters• Do you buy pre-configured systems or build your own
configuation? – Rocks is adaptable to many configurations
• Do you upgrade the full cluster at one time or in rolling mode? – Suggest all at once (very quick with Rocks) can be done
as a batch job.– Can support rolling, if desired.
• Do you perform formal acceptance or burn-in tests? – Unfortunately, no. Need more automated testing.
![Page 3: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/3.jpg)
Installation/Management
• Need to have a strategy for managing cluster nodes
• Pitfalls– Installing each node “by hand”
• Difficult to keep software on nodes up to date
– Disk Imaging techniques (e.g.. VA Disk Imager)• Difficult to handle heterogeneous nodes• Treats OS as a single monolithic system
– Specialized installation programs (e.g. IBM’s LUI, or RWCPs Multicast installer) –
• let Linux packaging vendors do their job
• Penultimate– RedHat Kickstart
• Define packages needed for OS on nodes, kickstart gives a reasonable measure of control.
• Need to fully automate to scale out (Rocks gets you there)
![Page 4: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/4.jpg)
Scaling out
• Evolve to management of “two” systems– The front end(s)
• Log in host• User’s home areas, passwords, groups• Cluster configuration information
– The compute nodes• Disposable OS image• Let software manage node heterogeneity• Parallel (re)installation• Data partitions on cluster drives untouched during re-installs
• Cluster-wide configuration files derived through reports from a MySQL database (DHCP, hosts, PBS nodes, …)
![Page 5: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/5.jpg)
NPACI Rocks Toolkit – rocks.npaci.edu
• Techniques and software for easy installation, management, monitoring and update of clusters
• Installation– Bootable CD + floppy which contains all the packages
and site configuration info to bring up an entire cluster• Management and update philosophies
– Trivial to completely reinstall any (all) nodes.– Nodes are 100% automatically configured
• Use of DHCP, NIS for configuration
– Use RedHat’s Kickstart to define the set of software that defines a node.
– All software is delivered in a RedHat Package (RPM)• Encapsulate configuration for a package (e.g.. Myrinet)• Manage dependencies
– Never try to figure out if node software is consistent• If you ever ask yourself this question, reinstall the node
![Page 6: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/6.jpg)
Rocks Current State – Ver. 2.1• Now tracking Redhat 7.1
– 2.4 Kernel– “Standard Tools” – PBS, MAUI, MPICH, GM, SSH, SSL, …– Could support other distros … don’t have staff for this.
• Designed to take “bare hardware” to cluster in a short period of time– Linux upgrades are often “forklift-style”. Rocks supports this as
the default mode of admin• Bootable CD
– Kickstart file for Frontend created from Rocks webpage.– Use same CD to boot nodes. Automated integration “Legacy Unix
config files” derived from mySQL database• Re-installation (we have a single HTTP server, 100 Mbit)
– One node: 10 Minutes– 32 nodes: 13 Minutes– Use multiple HTTP servers + IP-balancing switches for scale
![Page 7: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/7.jpg)
More Rocksisms• Leverage widely-used (standard) software wherever possible
– Everything is in RedHat Packages (RPM)– RedHat’s “kickstart” installation tool– SSH, Telnet (only during installation), Existing open source tools
• Write only the software that we need to write• Focus on simplicity
– Commodity components• For example: x86 compute servers, Ethernet, Myrinet
– Minimal• For example: no additional diagnostic or proprietary networks
• Rocks is a collection point of software for people building clusters– It evolving to include cluster software and packaging from more
than just SDSC and UCB – <[your-software.i386.rpm] [your-software.src.rpm] here>
![Page 8: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/8.jpg)
Rocks-dist
• Integrate RedHat Packages from– Redhat (mirror) – base distribution + updates– Contrib directory– Locally produced packages– Local contrib (e.g. commerically bought code)– Packages from rocks.npaci.edu
• Produces a single updated distribution that resides on front-end– Is a RedHat Distribution with patches and updates applied
• Kickstart (RedHat) file is a text description of what’s on a node. Rocks automatically produces frontend and node files.
• Different Kickstart files and different distribution can co-exist on a front-end to add flexibility in configuring nodes.
![Page 9: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/9.jpg)
insert-ethers
• Used to populate the “nodes” MySQL table
• Parses a file (e.g., /var/log/messages) for DHCPDISCOVER messages– Extracts MAC addr and, if not in
table, adds MAC addr and hostname to table
• For every new entry:– Rebuilds /etc/hosts and
/etc/dhcpd.conf– Reconfigures NIS– Restarts DHCP and PBS
• Hostname is– <basename>-<cabinet>-
<chassis>• Configurable to change hostname
– E.g., when adding new cabinets
![Page 10: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/10.jpg)
Configuration Derived from Database
mySQL DB
makehosts
/etc/hosts
makedhcp
/etc/dhcpd.conf
pbs-config-sql
pbs node list
insert-ethersNode 0
Node 1
Node N
Automated nodediscovery
![Page 11: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/11.jpg)
Remote re-installationShoot-node and eKV
• Rocks provides a simple method to remotely reinstall a node– CD/Floppy used to install the first time
• By default, hard power cycling will cause a node to reinstall itself.– Addressable PDUs can do this on generic hardware
• With no serial (or KVM) console, we are able to watch a node as installs (eKV), but …– Can’t see BIOS messages at boot up
• Syslog for all nodes sent to a log host (and to local disk)– Can look at what a node was complaining about before it
went offline
![Page 12: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/12.jpg)
192.168.254.254
Remotely starting reinstallation on two nodes
192.168.254.253
Remote re-installationShoot-node and eKV
![Page 13: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/13.jpg)
Monitoring your cluster
• PBS has a GUI called xpsmon. Gives a nice graphical view of up/down state of nodes
• SNMP status– Use the extensive SNMP MIB defined by the Linux
community to find out many things about a node• Installed software• Uptime• Load• Slow
• Ganglia (UCB) – IP Multicast-based monitoring system– 20+ different health measures
• I think we’re still weak here – learning about other activities in this area (e.g. ngop, CERN activities, City Toolkit)
![Page 14: Quick Overview of NPACI Rocks Philip M. Papadopoulos Associate Director, Distributed Computing San Diego Supercomputer Center](https://reader036.vdocuments.us/reader036/viewer/2022082816/56649d0a5503460f949dc116/html5/thumbnails/14.jpg)
Cern
• Cern.ch/hep-proj-grid-fabric• Installation tools : wwwinfo.cern.ch/pdp