a database-centric approach to system management the blue gene supercomputer
Post on 03-Jan-2016
18 Views
Preview:
DESCRIPTION
TRANSCRIPT
© 2008 IBM Corporation
A Database-Centric Approach to System Management
The Blue Gene Supercomputer
Tom Budnik
Mark Megerian
August 2008
© 2006 IBM Corporation2
Database-Centric System Management Why use a database for Blue Gene?
Need a software representation of the Blue Gene hardware A machine of such large scale requires a persistent means of storing errors (RAS
events), job history, block definitions, environmental readings, etc. DB2 is the central repository of ALL system information
Allows control system components to get hardware information and topology from the database, which is always kept current
Blue Gene Navigator pulls majority of data it displays from the database All current jobs, as well as all completed jobs are stored
Admins can see a history of every job that has ever been run on the machine We record start time and end time, as well as the number of nodes used, and this information
is used by Navigator to compute machine utilization All service actions, and replaced hardware, are tracked in the database
DB2 is used as a method of communication between components Setting values in the database can trigger actions in other components Can simplify the design by having policy stored in the database itself via procedures,
triggers, and constraints instead of the code Enforces consistency across components and reduces bugs
DB2 and the Control System run on the “Service Node” machine, which controls the Blue Gene nodes (pSeries running Linux)
© 2006 IBM Corporation3
Database-Centric System Management - Benefits
DB2 provides the storage of all data (except logs). This provides a well-known set of interfaces for:
Querying data using existing tools or SQL Building web interfaces and browser-based tools using JSF, PHP, Java, CLI, and
many other established technologies Standard classes so all code can easily interact with the database
System administrators can learn DB2 from books and classes New team members can come up to speed quickly Customers can write their own tools, no hidden or closed data structures Functions such as backup and recovery, performance settings, and
security are handled by DB2 DB2 is a robust, commercial database, able to handle large multi-user apps
© 2006 IBM Corporation4
Basic SQL Concepts Schema
The collection of objects such as tables, views, indexes, and triggers that define the database Blue Genes uses BGPSYSDB
Table (most common database object) A table is a collection of rows of data, organized into columns The table definition (CREATE TABLE) describes the columns and their names and data
types (integer, float, character, timestamp, etc.) Once a table is created, you can insert, update, delete rows, and query the contents Tables can be joined to other tables, and sorted and nested, to create many useful and
complex constructions of data
Example:CREATE TABLE TBGPBlockUsersCREATE TABLE TBGPBlockUsers
((
blockId char(32) NOT NULL,blockId char(32) NOT NULL,
username char(32) NOT NULL,username char(32) NOT NULL,
CONSTRAINT BGPBlkUsers_pk PRIMARY KEY (blockId, username),CONSTRAINT BGPBlkUsers_pk PRIMARY KEY (blockId, username),
CONSTRAINT BGPBlkUsers_fk FOREIGN KEY (blockId)CONSTRAINT BGPBlkUsers_fk FOREIGN KEY (blockId)
REFERENCES TBGPBlock(blockId) ON DELETE CASCADEREFERENCES TBGPBlock(blockId) ON DELETE CASCADE
););
CREATE ALIAS BGPBlockUsers for TBGPBlockUsers;CREATE ALIAS BGPBlockUsers for TBGPBlockUsers;
© 2006 IBM Corporation5
Basic SQL Concepts - continued Views
A view is a virtual view of data, it stores a description of how to retrieve and map the data, but it stores no data itself
Generally used to present the same data in different ways, and act like a “virtual” table
Example:
CREATE VIEW BGPMidplane as SELECT serialnumber, productid, CREATE VIEW BGPMidplane as SELECT serialnumber, productid, machineserialnumber, status, ismaster, posinmachine as location FROM machineserialnumber, status, ismaster, posinmachine as location FROM TBGPMidplane;TBGPMidplane;
Index (a stored, sorted set of pointers to rows) Like a view, an index contains no actual “data” An index is built to sequence the rows using a certain set of columns that is frequently
used for searching and sorting A full table scan through millions of rows for a particular value would take several
minutes, where a lookup using an index over that column is often sub second Indexes are kept current as the data changes, so a large number of indexes can impact
update performance. There is a tradeoff between query performance, and only necessary and useful indexes should be created.
Example:
CREATE INDEX EventLogJ on Tbgpeventlog (jobid, recid desc)CREATE INDEX EventLogJ on Tbgpeventlog (jobid, recid desc)
© 2006 IBM Corporation6
Basic SQL Concepts - continued
Triggers A trigger allows you to define an action to take place, generally when data is updated Triggers can be defined on an insert, update, or delete of rows in a table Triggers can fire “before” the action, and possibly modify the action taking place Triggers can also fire after the action Triggers can generate errors to block the action
Example:create trigger sc_history_icreate trigger sc_history_i
after insert on tbgpservicecardafter insert on tbgpservicecard
referencing new as nreferencing new as n
for each row mode db2sqlfor each row mode db2sql
begin atomic begin atomic
insert into tbgpservicecard_history insert into tbgpservicecard_history
(serialNumber, productId, midplanepos, status,vpd, action)(serialNumber, productId, midplanepos, status,vpd, action)
valuesvalues
(n.serialNumber, n.productId, n.midplanepos, n.status, n.vpd, 'I');(n.serialNumber, n.productId, n.midplanepos, n.status, n.vpd, 'I');
end @end @
© 2006 IBM Corporation7
Basic SQL Concepts - continued Constraint
A constraint is a “rule” that is enforced by the database Check constraints give a list of valid values for a column Unique constraints enforce uniqueness on values in a column, or set of columns Referential Integrity constraints enforces values in a “child” table exist in “parent” table
Example:CREATE TABLE TBGPMidplane
(
serialNumber char(19) ,
productId char(16) NOT NULL,
machineSerialNumber char(19) ,
posInMachine char(6) NOT NULL,
CONSTRAINT BGPMidPo_chk CHECK ( posInMachine LIKE 'R__-M_' ),
status char(1) NOT NULL WITH DEFAULT 'A' ,
CONSTRAINT BGPMidSt_chk CHECK ( status IN ('A','M','E', 'S') ),
isMaster char(1) NOT NULL WITH DEFAULT 'T',
CONSTRAINT BGPMidMs_chk CHECK ( isMaster IN ('T', 'F') ),
vpd VARCHAR(4096) FOR BIT DATA,
seqId BIGINT NOT NULL WITH DEFAULT 0,
CONSTRAINT BGPMidpplane_pk PRIMARY KEY (posInMachine),
CONSTRAINT BGPMidMachineId_fk FOREIGN KEY (machineSerialNumber)
REFERENCES TBGPMachine (serialNumber) ON DELETE RESTRICT,
CONSTRAINT BGPMidplaneType_fk FOREIGN KEY (productId)
REFERENCES TBGPProductType (productId)
)
© 2006 IBM Corporation8
DB2 Naming Guidelines for BG/P
Tables always start with TBGP, such as TBGPNodeCard or TBGPLinkCard Names are NOT case sensitive in SQL
For each table, there is a view that has the more user-friendly columns These are named without the T, such as BGPNodeCard In cases where some information is omitted from the view
If there is no need for any derived columns in the view, or omitted columns, then an alias is created
i.e. BGPClockCard The net effect is that most all the time, using the “BGP” name will show what you want
If there is a history being kept, then _history is added to the end
© 2006 IBM Corporation9
BG/P Tables TBGPBlock TBGPBPBlockMap TBGPSmallBlock TBGPLinkBlockMap TBGPProductType TBGPMachine TBGPMachineSubnet TBGPMidplane TBGPNodeCard TBGPNode TBGPServiceCard TBGPLinkCard TBGPClockCard TBGPBulkPowerSupply TBGPSwitch TBGPCable TBGPClockCable TBGPLinkChip TBGPICON TBGPFanModule TBGPJob TBGPEthGateway TBGPEGWMachineMap TBGPPortBlockMap TBGPBlockUsers TBGPMidplaneSubnet TBGPNodeSubnet TBGPServiceAction TBGPUserPrefs
TBGPReplacement_history
TBGPMachine_history
TBGPMidplane_history
TBGPNodeCard_history
TBGPNode_history
TBGPServiceCard_history
TBGPLinkCard_history
TBGPClockCard_history
TBGPLinkChip_history
TBGPIcon_history
TBGPFanModule_history
TBGPJob_history
TBGPServiceCardEnvironment
TBGPFanEnvironment
TBGPClockCardEnvironment
TBGPBULKPOWEREnvironment
TBGPNodeCardPOWEREnvironment
TBGPLinkCardPOWEREnvironment
TBGPSrvcCardPOWEREnvironment
TBGPLinkChipEnvironment
TBGPLinkCardEnvironment
TBGPNodeEnvironment
TBGPNodeCardEnvironment
TBGPEventLog
TBGPERRCodes
TBGPDiagRuns
TBGPDiagBlocks
TBGPDiagResults
TBGPDiagTests
© 2006 IBM Corporation10
BG/P Views
BGPMidplane BGPMidplaneAll BGPNodeCard BGPNodeCardAll BGPNode BGPNodeAll BGPServiceCard BGPServiceCardAll BGPLinkCard BGPLinkCardAll BGPClockCardAll BGPBulkPowerSupplyAllBGPLinkChip BGPLinkChipAllBGPFanModule BGPFanModuleAll BGPLink BGPClockCardEnvironmentBGPDiagTests
BGPNodeCardCountBGPLinkCardCountBGPServiceCardCountBGPNodeCountBGPBasePartitionBGPBPBlockStatusBGPSwitchLinksBGPLinkBlockStatusBGPSwitchPortBGPPortBlockStatusBGPBlockSize
11 Extreme Scalability with BlueGene/L © 2005 IBM Corporation
BG/P DB2 StructureDB2
Configuration Database
Operational Database
Environmental Database
RAS Database
Configuration database is the representation of all the hardware on the system
Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history
Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages
RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex
12 Extreme Scalability with BlueGene/L © 2005 IBM Corporation
BG/P DB2 StructureDB2
Configuration Database
Configuration database is the representation of all the hardware on the system
Machine Midplanes Service Cards Link Cards Link Chips Node Cards Processor Cards
Compute & I/O Nodes Cables Clock Cards Fan Modules
Populated during initial system install and kept current during hardware service actions
13 Extreme Scalability with BlueGene/L © 2005 IBM Corporation
BG/P DB2 StructureDB2
Operational Database
Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history
Blocks (partitions) Jobs Job history Switch settings Link <-> Block map Block users
Maintained by the Blue Gene control system running on the service node
14 Extreme Scalability with BlueGene/L © 2005 IBM Corporation
BG/P DB2 Structure
DB2
Environmental Database
Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages
Fan Modules Desired and actual fan speed Voltages Temperatures
Service Cards Ambient temp Voltages
Node Cards Chip temps Temp limits Wiring faults
Link Cards Power Status Temps
Hardware Monitor reads and stores information on customizable intervals
By default, BG/P purges the data every 3 months (mmcs_envs_purge_months=3). The db.properties configuration can be altered to store more or less data as required by the local environment.
15 Extreme Scalability with BlueGene/L © 2005 IBM Corporation
BG/P DB2 StructureDB2
RAS Database
RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex.
RAS events collected for bad hardware, missing cards, bad memory, bad cables
RAS events collected from compute complex while jobs are running, from kernel interrupts
RAS events generated by HW monitoring, for wiring faults, bad cards, fan speeds, over temps
RAS events generated by MMCS during link training, software errors, file system errors
© 2006 IBM Corporation16
Putting It All Together – Database Populate/Verification
Install team runs a Perl script (dbPopulate.pl) that populates the database with the expected configuration for the Blue Gene system.
The machine is powered on, and the InstallServiceAction program finds all hardware on the service network and verifies the database matches with the actual hardware config
This information is also modified and kept current during service actions (card replacement, recabling, etc.)
VerifyCables program confirms that the Torus network cabling is correct and VerifyIpAddresses confirms that the IO card IP addresses are correct
BGPMidplane
BGPCable
BGPServiceCard
BGPLinkCard
BGPNodeCard
BGPNode
© 2006 IBM Corporation17
Putting It All Together – Partitioning
Partitions are defined and the information is stored in DB2 Partitions can be defined using Navigator block builder, console commands
like genblock, using Bridge API pm_add_partition, or dynamically created by an external scheduler or mpirun
BGPBlock
BGPBpBlockMap
BGPLinkBlockMap
BGPPortBlockMap
BGPSmallBlock
© 2006 IBM Corporation18
Putting It All Together – Booting
Partition information from the database is used to boot the hardware and prepare it for running jobs
Database contains the kernel images for the IO nodes and Compute nodes Database contains all the switch settings needed to program the link chips
in order to create the Torus or Mesh Database relates block information to specific hardware
BGPBlock
BGPBpBlockMap
BGPLinkBlockMap
BGPPortBlockMap
BGPSmallBlock
BGPMidplane
BGPNodeCard
BGPLinkCard
BGPNode
© 2006 IBM Corporation19
Putting It All Together – Booting
Database prevents overlap by doing arbitration of nodes, switches, and cables This allows multiple partitions to be booted, provided they do not share the same nodes,
switch ports, or cables They can, however, share the same switch, which allows for pass-through Any attempt to boot a partition that overlaps with an already booted partition with fail
with a message that the hardware that is already in use
BGPBlock
BGPBpBlockMap
BGPLinkBlockMap
BGPPortBlockMap
BGPSmallBlock
© 2006 IBM Corporation20
Putting It All Together – Job execution
Jobs are submitted to booted blocks Job submission is done via console, mpirun, submit, Bridge APIs, or external scheduler Submitter must either be block owner or block user Control system polls hardware for RAS events during job execution and writes them into
the RAS Event Log table. Each event is identified by the exact piece of hardware on which it occurred.
Control system polls for job completion and writes into the job history table the start and end time, number of nodes, exit status, etc.
BGPJob
BGPBlock
BGPBlockUsers
BGPEventlog
BGPJob_History
© 2006 IBM Corporation21
Navigator: Web Interface to DB
Browser interface to view DB2 data Supports the viewing of RAS data, configuration data, diagnostics data, environmental
data and operational data Can be used to see how the hardware fits together Can be used to find trouble areas, hardware anomalies Eliminates the need to have SQL expertise to view basic Blue Gene information from the
database
© 2006 IBM Corporation22
Blue Gene Navigator (Job History)
© 2006 IBM Corporation23
Blue Gene Navigator (Blocks)
© 2006 IBM Corporation24
Blue Gene Navigator (RAS Event Log)
© 2006 IBM Corporation25
Blue Gene Navigator (Block Visualizer)
© 2006 IBM Corporation26
Summary
DB2 is the central repository of all control system information Database information is not just passively recorded, but rather its an
integrated communications method for the control system It has greatly enhanced our product:
Writing reports and queries for job utilization Querying RAS events and error trends Building end user tools Training new people, faster learning curve Can test the control system without any real hardware Lower cost of ownership for customers with better tools and accessible data
Stability and performance have been excellent, so its been one thing we have not had to spend a lot of time tuning or debugging
© 2008 IBM Corporation
BG/P Security
Tom BudnikAugust 2008
© 2006 IBM Corporation28
Security Admin Tool (bguser.pl)
Role Capability
user Submit jobs through mpirun (HPC) Submit jobs using submit command (HTC) Read access to some of data (job/block status) on Service node, through Navigator Access to the Front End nodes Complete access to compilers/tool chain/etc. for development on the Front End nodes
developer Submit jobs through mpirun (HPC) Submit jobs using submit command (HTC) Read access to some data (job/block status) on Service node, through Navigator Controlled and limited access to Service node - requires user ID on SN No root access but has elevated privileges Complete access to compilers/tool chain/etc. for development on the Front End nodes Debugging with coreprocessor
admin Complete access to Blue Gene/P functions on the Service Node and Front End Node(s)
Service (IBM) Access to required debug tools, system logs, and read access to database
The Security Administration tool assigns authority consistently to users who access the Blue Gene system. The tool authorizes users to various predetermined functions on the system by adding their profile to a selected group. The groups are: bgpuser, bgpdeveloper, bgpservice and bgpadmin
The groups are created on the Service Node when the Blue Gene/P system is installed Can edit the program to define the groups differently
Existing Linux users are added to groups by running the bguser.pl utility: ./bguser.pl [options] options are: --user userName --role [user/developer/service/admin]
© 2006 IBM Corporation29
Service Node
Groups
db2rasdb
db2iadm1 (DB2 client)
db2fadm1 (DB2 client)
db2asgrp (DB2 client)
Users
bgpsysdb (DB2 server instance)
bgpdb2c (DB2 client instance)
bgpadmin
mpirun
bgpuser
bgpdeveloper
bgpadmin
bgpservice
© 2006 IBM Corporation30
Front End Node
Groups bgpadmin bgpservice bgpdeveloper bgpuser
Users mpirun
Profile /etc/profile.d/bgp.sh
© 2006 IBM Corporation31
Database Access on Service Node
The db.properties file contains the information required to access the database. Typically located in /bgsys/local/etc
Keywords of interest:database_name=bgdb0
database_user=bgpsysdb
database_password=thepassword
database_schema_name=bgpsysdb
system=BGP
min_pool_connections=1 Access to Blue Gene DB from FEN is discouraged for security reasons
Reason for front-end and back-end mpirun
© 2006 IBM Corporation32
Navigator Security Authority
There are three roles defined in Navigator: End User, Service, and Administrator. The Linux group to Navigator role mapping is defined in the local Navigator configuration file (/bgsys/local/etc/navigator-config.xml). Note that the value can be a Linux group name or gid.
Administrator groups Users in these Linux groups have access to all the sections in Navigator. To have multiple
groups, add a <administrator-group>groupname</administrator-group> for each group. Service groups
Users in these Linux groups have access to the service sections in Navigator. To have multiple groups, add a <service-group>groupname</service-group> for each group.
User groups Users in these Linux groups have access to only the end-user pages in the Navigator.
These pages do not allow any updates to the Blue Gene system. To have multiple groups, add a <user-group>groupname</user-group> for each group.
© 2006 IBM Corporation33
Navigator Security - continued Navigator runs with the profile of the user that starts bgpmaster (typically bgpadmin)
User that starts bgpmaster needs read access to /etc/shadow files to allow Navigator to perform authentications
Navigator setup script The DB2 libraries must be made available to Tomcat so that it can access the database,
and the Java Authentication and Authorization Service (JAAS) plug-in for interfacing to Linux Pluggable authentication modules (PAM) must be available so that Tomcat can authenticate users. A script is provided to do the setup: $ cd /bgsys/drivers/ppcfloor/navigator $ ./setup-tomcat.sh
Setup anonymous access to end-user pages By default, the previous setup configures the Navigator to allow only authenticated users
to access the Web interface. To enable anonymous access to end-user pages, you need to copy the web-withenduser.xml file into Navigator’s configuration:
$ cd /bgsys/drivers/ppcfloor/navigator$ cp web-withenduser.xml
catalina_base/webapps/BlueGeneNavigator/WEB-INF/web.xml
© 2006 IBM Corporation34
Navigator Security - continued
PAM authentication: Navigator uses the bluegene PAM stack to authenticate users. This is setup by creating a file
/etc/pam.d/bluegene:#%PAM-1.0auth include common-authauth required pam_nologin.soaccount include common-accountpassword include common-passwordsession include common-session
SSL Configuration for Navigator Tomcat HTTP server instance: By default the Tomcat instance requires that Secure Sockets Layer (SSL) be configured and the
server listens on port 32072 By default Navigator uses the /bgsys/local/etc/navigator.keystore keystore. This must be created
when the system is configured. To do this, the keytool command is used.
© 2006 IBM Corporation35
mpirun Security End user IDs are not required to exist on the service node Requires mpirun_be to run under bgpadmin User’s uid and gid are collected from the front-end and propagated to CIOD
Used for file system access permissions
© 2006 IBM Corporation36
mpirun Security - continued Challenge protocol
A challenge/response protocol is used to authenticate the mpirun client when connecting to the mpirun daemon on the service node
Authentication uses the OpenSSL Secure Hash Algorithm (SHA-2) and a shared secret Protocol uses shared secret to create a hash of a random number, thereby verifying
the mpirun front end node has access to the secret To protect secret, it is stored in configuration file only accessible by the bgpadmin
user on the service node and by a special mpirun user on the front end node The front end mpirun binary has its setuid flag enabled so it can change its uid to
match the mpirun user and read the config file to access the secret mpirun.cfg file
The mpirun.cfg file contains the shared secret used by the mpirun daemon File needs to exist on the SN and any FENs that submit jobs using mpirun
Files need to match exactly for authentication to work The mpirun.cfg file is in the /bgsys/local/etc/ directory Only bgpadmin and the special mpirun user needs to have access to the file
mpirun.cfg file example: CHALLENGE_PASSWORD=BGPmpirunPasswd
top related