© 2008 ibm corporation blue heron project ibm rochester: tom budnik: tbudnik@us.ibm.com amanda...
Post on 26-Dec-2015
218 Views
Preview:
TRANSCRIPT
© 2008 IBM Corporation
Blue Heron ProjectIBM Rochester: Tom Budnik: tbudnik@us.ibm.com Amanda Peters: apeters@us.ibm.com
Condor: Greg Thain
With contributions from: IBM Rochester: Mark Megerian, Sam Miller, Brant Knudson and Mike Mundy Other IBMers: Patrick Carey, Abbas Farazdel, Maria Iordache and Alex Zekulin UW-Madison Condor: Dr. Miron Livny
April 30, 2008
© 2008 IBM Corporation2
Agenda
What is the Blue Heron Project?
Condor and IBM Blue Gene Collaboration
Introduction to Blue Gene/P
What applications fit the Blue Heron model?
How does Blue Heron work?
Information Sources
Condor on BG/P demo (Greg Thain)
© 2008 IBM Corporation3
What is the Blue Heron Project?
Blue GeneEnvironment
Serial and Pleasantly Parallel Apps
Highly Scalable Msg Passing Apps
Paths Toward aPaths Toward aGeneral Purpose MachineGeneral Purpose Machine
*** NEW *** Available 5/16/08 HTC HPC (MPI)
Blue Heron = Blue Gene/P HTC and Condor
Blue Heron provides a complete integrated solution that gives users a simple, flexible mechanism for submitting single-node jobs.
Blue Gene looks like a "cluster" from an app’s point of view
Blue Gene supports hybrid application environment
Classic HPC (MPI) apps and now HTC apps
© 2008 IBM Corporation4
and Blue Gene Collaboration
Both IBM and Condor teams engaged in adapting code to bring Condor and Blue Gene technologies together
Previous Activities (BG/L) • Prototype/research Condor running HTC workloads
Current Activities (BG/P)
• Blue Heron Project Partner in design of HTC services Condor supports HTC workloads using static partitions
Future Collaboration (BG/P and BG/Q)• Condor supports dynamic machine partitioning• Condor supports HPC (MPI) jobs• I/O Node exploitation with Condor • Persistent memory support (data affinity scheduling)• Petascale environment issues
© 2008 IBM Corporation5
Introduction to Blue GeneTechnology Roadmap
2004 2007
Blue Gene/PPPC 450 @ 850MHzScalable to 3+ PF
Blue Gene/Q
Blue Gene/LPPC 440 @ 700MHzScalable to 596+ TF
BG/P is the 2nd Generation of the Blue Gene Family
© 2008 IBM Corporation6
Introduction to Blue Gene/P
Chip
13.6 GF/s8 MB EDRAM
4 processors
1 chip, 20 DRAMs
13.6 GF/s2 or 4 GB DDR2
32 Node Cardsup to 64x10 GigE
I/O links
14 TF/s2 or 4 TB
up to 3.56 PF/s512 or 1024 TB
CabledRack
System
Compute Card
435 GF/s64 or 128 GB
32 Compute Cards up to 2 I/O cards
Node Card
Leadership performance in a space-saving, processor dense, power-efficient package.
High reliability: Designed for less then 1 failure per rack per year (7 days MTBF for 72 racks).
Easy administration using the powerful web based Blue Gene Navigator.
Ultrascale capacity machine (“cluster buster”): run 4,096 HTC jobs on a single rack.
The system scales from 1 to 256 racks: 3.56 PF/s peak
Quad-Core PowerPCSystem-on-Chip
up to 256 racks
© 2008 IBM Corporation7
What applications fit the Blue Heron model? Master/Worker Paradigm:
Many “pleasantly parallel” apps on BG/P use a compute node as the “master node”
Advantage of Blue Heron (HTC) Solution: Move the “master node” from a Blue Gene compute node to the Front-End Node (FEN). This is a better
solution for the following reasons: Application resiliency: In MPI model a single node failure kills entire app for the partition. In HTC mode only
the job running on the failed node is ended, other single node jobs continue to run on partition. FEN has more memory, better performance, more functionality than a single compute node Code that runs on the compute nodes is much cleaner, since it only contains the work to be performed, and
leaves the coordination to a script or scheduler (NO MPI NEEDED) The coordinator functionality can be a Perl script, Python, compiled program, or anything that runs on Linux The coordinator can interact directly with DB2 or MySQL, to either get the inputs for the application, or to
store the results. This can eliminate the need to create a flat-file input for the app, or to generate the results in an output file.
Example: American Monte Carlo (options pricing) Reference: en.wikipedia.org/wiki/Monte_Carlo_methods_in_finance
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if (rank == 0) {
// send work to other nodes and collect results
} else {
// do real work
}
© 2008 IBM Corporation8
How does Blue Heron work? “Software Architecture Viewpoint”
Lightweight
Extreme scalability
Flexible scalability
High throughput (fast)
Design Goals:
© 2008 IBM Corporation9
How does Blue Heron work? “End user perspective”
“submit” client: Acts as a shadow or proxy for the real job running on the compute node – very lightweight
Submit jobs to location or pool Pool id concept: scheduler alias for a collection of partitions available to run a job on location: the resource where the job will execute in the form of a processor or wildcard location
Example #1 (submit to location): submit -location “R00-M0-N00-J05-C00” -exe hello_world
Example #2 (submit to pool): submit -pool BIOLOGY –exe hello_world
Job scheduler example:
Submit jobs using Condor (“condor_submit”)
Submitting jobs (typically from FEN):
© 2008 IBM Corporation12
Information Sources
Official Website www.ibm.com/servers/deepcomputing/bluegene.html
Blue Gene Redbooks and Redpapers For the latest list go to For the latest list go to www.redbooks.ibm.comwww.redbooks.ibm.com and search for “Blue Gene” and search for “Blue Gene”
IBM Journal of Research and Development researchweb.watson.ibm.com/journal/rd/521/team.html
www.research.ibm.com/journal/rd49-23.html
Research Site www.research.ibm.com/bluegene/index.html
TOP500 List www.top500.org
Green500 List www.green500.org
© 2008 IBM Corporation13
Condor using HTC on BG/P Demo: Rosetta++ with MySQL
Rosetta++ is a protein prediction algorithm
It is very well-suited to HTC, since it runs many simulations of the same protein, using different random number seeds The one that results in the lowest energy model among those attempted is the “solution”
Rosetta++ had already been shown to work on Blue Gene, by David Baker’s lab
Our goal was to show that it runs well in HTC mode
Very little actual code changes were required: Compiled for Blue Gene, but using the single node version (NO MPI)
Changed a few places that did file output to use stdout, since that made it easier for the submitting script to associate each task to its results
Created a simple database front-end using both DB2 and MySQL, to contain the proteins and the seeds
Perl script reads inputs from database, submits each task to Condor, and processes results back into the database
Demonstrates HTC mode using Condor, with perfect linear scaling and no MPI
© 2008 IBM Corporation16
What are the Blue Gene System Components?
Blue Gene Rack(s)Hardware/Software
Host SystemService Node and Front End (login) Nodes
SuSE SLES/10, HPC SW Stack,File Servers, Storage Subsystem,
XLF/C Compilers, DB23rd Party Ethernet Switch
© 2008 IBM Corporation17
Blue Gene Integrated Networks
Torus Compute nodes only Direct access by app DMA
Collective Compute and I/O node attached 16 routes allow multiple network
configurations to be formed Contains an ALU for collective
operation offload Direct access by app
Barrier Compute and I/O nodes Low latency barrier across
system (< 1usec for 72 racks) Used to synchronize time bases Direct access by app
10Gb Functional Ethernet I/O nodes only
1Gb Private Control Ethernet Provides JTAG, i2c, etc, access
to hardware. Accessible only from Service Node
Clock network Single clock source for all racks
© 2008 IBM Corporation18
Blue Gene is the most Power, Space, and Cooling Efficient Supercomputer(Published specs per peak performance)
0%
100%
200%
300%
400%
Racks/TF kW/TF Sq Ft/TF Tons/TF
Sun/Constellation Cray/XT4 SGI/ICE
IBM BG/P
© 2008 IBM Corporation19
Blue Gene is Orders of Magnitude more Reliable than other Platforms
Results of survey conducted by Argonne National Lab on 10 clusters ranging from 1.2 to 365 TFlops (peak); excluding storage subsystem, management nodes, SAN network equipment, software outages.
* Estimated based on reliability improvements implemented in BG/P compared to BG/L
0
200
400
600
800Fa
ilure
s pe
r m
onth
for
a 10
0 TF
/s s
yste
m
Itanium2 x86 Power5 BG/L BG/P
394
127
1
800
<1*
© 2008 IBM Corporation20
Blue Gene Software Hierarchical Organization
Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)
I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination
Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software
© 2008 IBM Corporation21
Quad Mode Also called Virtual Node Mode All 4 cores run 1 process each No threading Each process gets ¼ node
memory MPI/HTC programming model
Dual Mode 2 cores run 1 process each Each process may spawn 1
thread on core not used by other process
Each process gets ½ node memory
MPI/OpenMP/HTC programming model
SMP Mode 1 core runs 1 process Process may spawn threads
on each of the other cores Process gets full node
memory MPI/OpenMP/HTC
programming model
M
P
M
P
M
P
Memory address space
M
Co
re 0
P
Application
Co
re 1
Co
re 2
Co
re 3
Application
M
P
T
M
P
TCo
re 0
Co
re 1
Co
re 2
Co
re 3
Memory address space
CPU2 CPU3
Application
M
P
T T TCo
re 0
Co
re 1
Co
re 2
Co
re 3
Memory address space
BG/P Job Modes allow Flexible use of Compute Node Resources
© 2008 IBM Corporation22
Why and for What is Blue Gene Used? Improve understanding – significantly larger scale, more complex and higher resolution
models; new science applications Multiscale and multiphysics – From atoms to mega-structures; coupled applications Shorter time to solution – Answers from months to minutes
Physics – Materials ScienceMolecular Dynamics
Environment and Climate Modeling Life Sciences: Sequencing
BiologicalModeling – Brain Science
Computational Fluid Dynamics
Life Sciences: In-Silico Trials, Drug Discovery
Financial ModelingStreaming Data Analysis
Geophysical Data ProcessingUpstream Petroleum
© 2008 IBM Corporation23
Many Computational Science Modeling and Simulation Algorithms and Numerical Methods are Massively Parallel
Good Better Best
BasicAlgorithms &
NumericalMethods
PipelineFlows
Biosphere/Geosphere
Neural Networks
Condensed MatterElectronic Structure
CloudPhysics
-
ChemicalReactors
CVD
PetroleumReservoirs
MolecularModeling
BiomolecularDynamics / Protein Folding
RationalDrug DesignNanotechnology
FractureMechanics
ChemicalDynamics Atomic
ScatteringsElectronicStructure
Flows in Porous Media
FluidDynamics
Reaction-Diffusion
MultiphaseFlow
Weather and ClimateStructural Mechanics
Seismic Processing
AerodynamicsGeophysical Fluids
QuantumChemistry
ActinideChemistry
CosmologyAstrophysics
VLSIDesign
ManufacturingSystems
MilitaryLogistics
NeutronTransport
NuclearStructure
QuantumChromo-Dynamics Virtual
Reality
VirtualPrototypes
ComputationalSteering
Scientific Visualization
MultimediaCollaborationTools
CAD
GenomeProcessing
Databases
Large-scaleData Mining
IntelligentAgents
IntelligentSearch
Cryptography
Number Theory
Ecosystems
EconomicsModels
Astrophysics
SignalProcessing
Data Assimilation
Diffraction & InversionProblems
MRI Imaging
DistributionNetworks
Electrical Grids
Phylogenetic TreesCrystallography
TomographicReconstruction
ChemicalReactors
PlasmaProcessing
Radiation
MultibodyDynamics
Air TrafficControl
PopulationGenetics
TransportationSystems
Economics
ComputerVision
AutomatedDeduction
ComputerAlgebra
OrbitalMechanics
Electromagnetics
Magnet DesignSource: Rick Stevens, Argonne National Lab and The University of Chicago
SymbolicProcessing
Pattern Matching
RasterGraphics
MonteCarlo
DiscreteEvents
N-Body
FourierMethods
GraphTheoretic
Transport
Partial Diff. EQs.
Ordinary Diff. EQs.
Fields
© 2008 IBM Corporation24
What applications fit the Blue Heron model?Wide range of applications can run in HTC mode
Many applications that run on Blue Gene today are “embarrassingly (pleasantly) parallel” or “independently parallel”
They don’t exploit the torus for MPI communication and just want a large number of small tasks, with a coordinator of results
HTC Application Identification
Solution Statement: A high-throughput computing (HTC) application is one in which the same basic calculation must be performed
over many independent input data elements and the results collected. Because each calculation is independent, it is extremely easy to spread calculations out over multiple cluster nodes. For this reason, high-throughput applications are sometimes called “embarrassingly parallel.” HTC applications occur much more frequently than one might think, showing up in areas such as parameters studies, search applications, data analytics, and what-if calculations.
Identifying a HTC application: There are a number of identifiers you can use to determine if your specific computing problem fits into the
category of a high-throughput application:
Do you need to run many instances of the same application with different arguments or parameters?
Do you need to run the same application many times with different input files?
Do you have an application that can select subsets of the input data and whose results can be combined by a simple merge process such as concatenating, placing them into a single data base, or adding them together?
If the answer to any of these questions is “yes,” then it is quite likely that you have a HTC application.
Source: Grid.org
© 2008 IBM Corporation25
How does Blue Heron work?
Key Features:
Provides a job submit command that is simple, lightweight, and extremely fast
Job state is integrated into Control System database, so administrators know which nodes have jobs, and which are idle
Provides stdin/stdout/stderr on a per-job basis
Enables individual jobs to be signaled or killed
Maintains a user ID on per-job basis (allows multiple users per partition)
Blue Gene Navigator shows HTC jobs (active or in history) with job exit status & runtime stats
Designed for easy integration with job schedulers (e.g. Condor, LoadLeveler, SIMPLE, etc.)
© 2008 IBM Corporation26
submit command./submit [options] or ./submit [options] binary [arg1 arg2 ... argn]
Job options:
[-]-exe <exe> executable to run
[-]-args "arg1 arg2 ... argn" arguments, must be enclosed in double quotes
[-]-env <env=value> add an environmental for the job
[-]-exp_env <env> export an environmental to the job's environment
[-]-env_all add all current environmentals to the job's environment
[-]-cwd <cwd> the job's current working directory
[-]-timeout <seconds> number of seconds before the job is killed
[-]-strace run job under system call tracing
Resource options:
[-]-mode <SMP|DUAL|VNM> the job mode
[-]-location <Rxx-Mx-Nxx-Jxx-Cxx> compute core location to run the job
[-]-pool <id> compute node pool ID to run the job
Options:
[-]-port <port> listen port of the submit mux to connect to (default 10246)
[-]-trace <0-7> tracing level, default(6)
[-]-enable_tty_reporting disable the default line buffering of stdin, stdout, and stderr when input (stdin) or output (stdout/stderr) is not a tty
[-]-raise if a job dies with a signal, submit will raise this signal
top related