nnsa advanced simulation and computing: past, present, … · 2004. 3. 15. · $2m/tf (purple c)...
Post on 28-Jan-2021
4 Views
Preview:
TRANSCRIPT
-
NNSA Advanced Simulation and Computing: Past, Present, Future
Steve LouisIntegrated Computing and Communications Department
Lawrence Livermore National Laboratory7000 East Avenue, Livermore, CA, 94550-9234
TEL: 1-925-422-1550 FAX: 1-925-423-8715E-mail: stlouis@llnl.gov
Presented at the March 9-10, 2004 THIC MeetingSony Auditorium
3300 Zanker Rd, San Jose CA 95134-1940UCRL-PRES-202571
This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.
-
THIC Meeting, San Jose CA, 10-Mar-2004 2
OutlineOutlineOutline
• The NNSA Stockpile Stewardship Program
• Where we are now: ASCI Red, Blue, White, Q
• Challenges for today: Linux, Red Storm, Purple
• The future challenges: BlueGene/L and beyond
• A short digression: Storage and file systems
-
THIC Meeting, San Jose CA, 10-Mar-2004 3
The NNSA’s StockpileStewardship Program (SSP)
The NNSA’s StockpileThe NNSA’s StockpileStewardship Program (SSP)Stewardship Program (SSP)
• In October 1992, underground nuclear testing ceased by Presidential decree
• In August 1995, the President announced U.S. intention to pursue stockpile stewardship in the absence of nuclear testing
• The NNSA SSP is tasked with ensuring the safety, security, and reliability of the nation’s stockpile
• Leading-edge physics simulation capabilities are key to assessment and certification requirements
-
THIC Meeting, San Jose CA, 10-Mar-2004 4
Simulation plays central role to maintain stockpile confidence
Simulation plays central role to Simulation plays central role to maintain stockpile confidencemaintain stockpile confidence
But its value is critically dependent on the other elements of the integrated programBut its value is critically dependent on the other elements of the integrated program
-
THIC Meeting, San Jose CA, 10-Mar-2004 5
Role of Advanced Simulationand Computing Program (ASC)Role of Advanced SimulationRole of Advanced Simulation
and Computing Program (ASC)and Computing Program (ASC)
• ASC Mission:Provide computational means to assess and certify the safety, performance and reliability of nuclear stockpile and its components
• ASC Goals:Deliver predictive codes based on multi-scale modeling, code verification and validation, small-scale experimental data, test data, judgment engineering analysis, expert judgment
• ASC started in 1996 (as ASCI):approximately 1/8 of the total Stockpile Stewardship Program budget
-
THIC Meeting, San Jose CA, 10-Mar-2004 6
Confluence of events leadsto creation of ASCI in 1996Confluence of events leadsConfluence of events leadsto creation of ASCI in 1996to creation of ASCI in 1996
• Inexpensive commodity “killer micros” emerged with scalar speeds equal to or exceeding those of custom processors
• Parallel computing technology reached the point where it could begin to support large multi-physics simulations
• The decision to halt underground nuclear testinggenerated a requirement for much higher fidelity simulations
• Reducing numerical errors to no longer mask inadequacies in the physical models requires ~ 100 teraFLOPS (initial 2004 target for ASCI)
-
THIC Meeting, San Jose CA, 10-Mar-2004 7
ASC Program is more than platforms and physics codesASC Program is more than ASC Program is more than
platforms and physics codesplatforms and physics codes
AdvancedApplications
Materials and PhysicsModeling
Integration
Simulation Support
Physical Infrastructure
And Platforms
Problem SolvingEnvironment
UniversityPartnerships
AdvancedArchitectures
ComputationalSystems Verification
andValidation
VIEWS PathForwardDISCOM
-
THIC Meeting, San Jose CA, 10-Mar-2004 8
OutlineOutlineOutline
• The NNSA Stockpile Stewardship Program
• Where we are now: ASCI Red, Blue, White, Q
• Challenges for today: Linux, Red Storm, Purple
• The future challenges: BlueGene/L and beyond
• A short digression: Storage and file systems
-
THIC Meeting, San Jose CA, 10-Mar-2004 9
ASCI Red, Blue, White, QASCI Red, Blue, White, QASCI Red, Blue, White, Q
Red(upgraded ‘99)
SNL 1997 Intel 9,460 3.15 Pentium II 333MHz
2,500 0.85 Use existing
BluePacific
LLNL 1998 IBM 5,808 3.89 PowerPC 604e
5,100 0.6 Use existing
BlueMountain
LANL 1998 SGI 6,144 3.072 MIPS R10000
12,000 1.6 Use existing
White LLNL 2000 IBM 8,192 12.3 Power3-II 10,000 1.0 Expandexisting
Q LANL 2002 HPCompaq
8,192 20.5 Alpha EV-68
14,000 1.9 Newbuilding
System Host Install Date
Vendor # ofProcessors
Peak Floating
Point (TFLOPs)
Processor Type
Footprint (sq ft)
Power (MW)
Construction Issues
More information at http://www.llnl.gov/asci/platforms/platforms.htmlMore information at http://www.llnl.gov/asci/platforms/platforms.html
-
THIC Meeting, San Jose CA, 10-Mar-2004 10
ASCI Red at SNLASCI Red at SNLASCI Red at SNL
• 4,576 nodes and 9,472 Pentium II processors, 400 MB/sec interconnect, 3.21 TF (after upgrades)
• 3D mesh (38x32x2) interconnect with separate partitions for service, I/O, computing nodes - red/black switching
• OSF1 runs on service and I/O nodes, Puma/Cougar light-weight kernel runs on compute nodes
Net I/O
Service
Users
File I/OCompute
/home
-
THIC Meeting, San Jose CA, 10-Mar-2004 11
ASCI Blue Pacific at LLNLASCI Blue Pacific at LLNLASCI Blue Pacific at LLNL
24
24
24HPGNHPGN
FDDI6
6
Sector YEach SP sector comprised of• 488 Silver nodes• 24 HPGN Links
System Parameters• 3.89 TFLOP/s Peak• 2.6 TB Memory• 62.5 TB Global disk
1.5 GB/node Memory20.5 TB Global Disk4.4 TB Local Disk
GbE
2
GbE
2
GbE
2
Sector S
2.5 GB/node Memory24.5 TB Global Disk8.3 TB Local Disk
Sector K
1.5 GB/node Memory20.5 TB Global Disk4.4 TB Local Disk
-
THIC Meeting, San Jose CA, 10-Mar-2004 12
ASCI White IBM Nighthawk-2Compute Node SpecificationASCI White IBM NighthawkASCI White IBM Nighthawk--22Compute Node SpecificationCompute Node Specification
CPUs per Nodes 16CPU Clock Speed 375 MHzNode Peak Perf. ~24 GigaOP/sMemory per node 16 GB Local Disk per node 72 GB
POWER3 processors are super-scalar pipelined 64-bit RISC chips with two floating-point units and three integer units. They are capable of executing up to eight instructions per clock cycle and up to four floating-point operations per cycle.
-
THIC Meeting, San Jose CA, 10-Mar-2004 13
ASCI Q at LANL is Alpha ES45SMP with Quadrics interconnectASCI Q at LANL is Alpha ES45ASCI Q at LANL is Alpha ES45
SMP with Quadrics interconnectSMP with Quadrics interconnect
• Alpha 21264 EV-68 processor• AlphaServer ES45 SMP
• 4 processor/SMP, 8/16/32 GB Memory/SMP
• Quadrics (QSW) dual rail switch interconnect• Fat-tree switch• High bandwidth (250 MB/s/rail), low latency(~5 us)
• Switch-based Fibre attached Storage Arrays• RAID5 sets, 72 GB Drives
• AlphaServer SC and Tru64 Unix based
-
THIC Meeting, San Jose CA, 10-Mar-2004 14
Platforms are shaken out with groundbreaking science runsPlatforms are shaken out with Platforms are shaken out with groundbreaking science runsgroundbreaking science runs
The first million-atom simulation in biology: molecular mechanism of the genetic code. (Kevin Sanbonmatsu)
This work (on ASCI Q at LANL) will define a new state-of-the-art in bio-molecular simulation, paving the way for other large biological studies.
Conclusions•This simulation more than 5 times larger than previous largest to date.
•Core of the ribosome is more stable than outer regions.
• Identified possible pivot point for ratchet motion during translocation.
-
THIC Meeting, San Jose CA, 10-Mar-2004 15
OutlineOutlineOutline
• The NNSA Stockpile Stewardship Program
• Where we are now: ASCI Red, Blue, White, Q
• Challenges for today: Linux, Red Storm, Purple
• The future challenges: BlueGene/L and beyond
• A short digression: Storage and file systems
-
THIC Meeting, San Jose CA, 10-Mar-2004 16
LLNL strategy is to straddlemultiple technology wavesLLNL strategy is to straddleLLNL strategy is to straddlemultiple technology wavesmultiple technology waves
Three complementary curves…
1. Delivers to today’s stockpile demanding needs– Production environment– For “must have” deliverables
2. Delivers transitions for next generation– “Near production,” but riskier
environment – Capacity/capability systems in a
strategic mix
3. Delivers affordable path to petaFLOP/s computing– Research environment, leading
transition to petaflop systems
Perf
orm
ance
Time
Mainframes(RIP)
Vendor integrated SMP Cluster
(IBM SP, HP SC)
IA32/ IA64/AMD + Linux
Cell-Based(IBM BG/L)
Today FY05
$2M/TF (Purple C)
$1.2M/TF(MCR)
$170K/TF
$10 M/TF (White)
$7M/TF (Q) $ 500K /TF
Any given technology curve is ultimatelylimited by Moore’s Law
-
THIC Meeting, San Jose CA, 10-Mar-2004 17
Ramifications of LLNL strategyRamifications of LLNL strategyRamifications of LLNL strategy
• Benefits—Maximizes cost performance and adapt quickly to change—Offer options to customers that match their requirements
• Costs—Requires expertise in multiple technologies
– Simultaneously field systems on multiple technology curves—Requires constant attention to new technology
– Must correctly assess the longevity, maturity (risk), and usability of technology
• Issues—Programming model/environment must be made as
consistent as possible
-
THIC Meeting, San Jose CA, 10-Mar-2004 18
Maintain similar programming model across LLNL platformsMaintain similar programming Maintain similar programming model across LLNL platformsmodel across LLNL platforms
Local
Shared Serial (NFS) I/O
LocalGlobal Shared Scalable I/O
MPI Comms
OpenMP OpenMP OpenMP
Idea: Provide a consistent programming model for multiple platform generations and across multiple vendors!!
Idea: Incrementally increase functionality over time!!
Local
-
THIC Meeting, San Jose CA, 10-Mar-2004 19
A supercomputing caveat: a new facility may be requiredA supercomputing caveat: a A supercomputing caveat: a new facility may be requirednew facility may be required
Nicholas MetropolisCenter for Modeling
and Simulation, LANL
Future Terascale Simulation Facility at LLNLFuture Terascale Simulation Facility at LLNL
Future Building for Red Storm at SNLFuture Building for Red Storm at SNL
-
THIC Meeting, San Jose CA, 10-Mar-2004 20
The MCR Linux Cluster made 10+ teraFLOPS computing affordableThe MCR Linux Cluster made 10+ The MCR Linux Cluster made 10+ teraFLOPS computing affordableteraFLOPS computing affordable
OSTOST OST
OST OSTOST OST
OST OSTOST OST
OST OSTOST OST
OST
QsNet Elan3, 100BaseT Control
1,116 P4 Compute Nodes
2 Login nodeswith 4 Gb-Enet
2 Service
64 Object Storage Targets70 MB/s delivered eachLustre Total 4.48 GB/s
GW
2 MetaData (fail-over) Servers32 Gateway nodes @ 140 MB/s
delivered Lustre I/O over 2x1GbE
GbEnet Federated Switch
GW GW GW GW GW GW GW
1152 Port (12x96D32U+4x96D32U) QsNet Elan3
100BaseT Management
MDS MDS
System Parameters• Dual 2.4 GHz Pentium 4 Prestonia nodes with 4.0 GB PC2100 DDR SDRAM• Aggregate 11.1 TF/s peak, 4.608 GiB memory•
-
THIC Meeting, San Jose CA, 10-Mar-2004 21
LANL announced Lightning Linux Cluster in August 2003LANL announced Lightning LANL announced Lightning
Linux Cluster in August 2003Linux Cluster in August 2003
• Theoretical peak speed of 11.26 teraFLOP/s• Will be built and integrated by Linux NetworX• Will be used for the smaller SSP calculations• Initial delivery with 1280 dual-processor nodes• Option to extend machine to 1408 nodes • Uses AMD Opteron 64-bit processors, Linux• Uses Myrinet 2000 Lanai XP interconnect
Yet another example of capacity computingat around $1 million or so per peak teraFLOP
More information on Lightning at http://www.lnxi.com/news/lightning_info.phpMore information on Lightning at http://www.lnxi.com/news/lightning_info.php
-
THIC Meeting, San Jose CA, 10-Mar-2004 22
Thunder is newest LLNL cluster:a 1000+ node quad Itanium2
Thunder is newest LLNL cluster:Thunder is newest LLNL cluster:a 1000+ node quad Itanium2a 1000+ node quad Itanium2
Thunder at LLNL will be the world’s largest Linux cluster (www.llnl.gov/linux/thunder/)
• 23 TF (peak) procurement on condensed schedule to meet crushing demand
• Quad 1.4 GHz Itanium2 Tiger4 nodes• 8.0 GB DDR266 SDRAM per node• 400 MB/s transfers to archive over quad
Jumbo Frame Gb-Enet and QSW links• 75 TB in local, 192 TB global parallel disk• Lustre file system with 6.4 GB/s delivered
parallel I/O performance• Expected to be operational and in use by
early 2004
-
THIC Meeting, San Jose CA, 10-Mar-2004 23
Key Open Source Softwarefor MCR and Thunder clusters
Key Open Source SoftwareKey Open Source Softwarefor MCR and Thunder clustersfor MCR and Thunder clusters
• Chaos—RedHat Linux distribution (www.redhat.com)—LinuxBIOS (www.linuxbios.org) —Chaos (www.llnl.gov/linux/chaos)
• Lustre cluster wide file system—www.lustre.org
• Quadrics QsNet drivers—www.quadrics.com
• SLURM/DPCS/LCRM—http://www.llnl.gov/icc/lc/dpcs/dpcs_overview.html—http://www.llnl.gov/linux/slurm/slurm.html
http://www.redhat.com/http://www.linuxbios.org/http://www.llnl.gov/linux/chaoshttp://www.lustre.org/http://www.quadrics.com/http://www.llnl.gov/icc/lc/dpcs/dpcs_overview.htmlhttp://www.llnl.gov/linux/slurm/slurm.html
-
THIC Meeting, San Jose CA, 10-Mar-2004 24
Linux cluster and opensource observations
Linux cluster and openLinux cluster and opensource observationssource observations
• 10 TF/s scale computing is now becoming affordable
—LTO options for $10-15M over 2-4 yrs put it into the budget range of departmental supercomputing
• These technologies are somewhat disruptive
— Important to factor Linux and commodity hardware into an overall computing strategy
• Significant opportunities for broad collaborations
—Key open source cluster technologies under active development
-
1,004 Quad Itanium2 Compute Nodes
B451GW GW MDS MDS
4 Login nodeswith 6 Gb-Enet
16 Gateway nodes @ 350 MB/sdelivered Lustre I/O over 4x1GbE
Thunder – 1,024 Port QsNet Elan4
OST OST OST64OST
Heads
65,536 Dual PowerPC 440 Compute Nodes
1024 I/O NodesPPC440
TSF
BG/L torus, global tree/barrier
OST OST OST208OST
Heads
64OST
Heads
OCF SGS File System Cluster (OFC)
FederatedEthernet
B439
B113
PVC - 128 Port Elan3
52 Dual P4Render Nodes
B451
GW GW
6 Dual P4Display
BB/MKS Version 6Dec 23, 2003
24
PFTPMM Fiber 1 GigE
SM Fiber 1 GigE
Copper 1 GigE
2 Login nodes
OST OST OST OST
2Gig FC
OST Dual P4 Head
FC RAID
146, 73, 36 GB
400- 600 TerabytesHPSSArchive
LLNLExternal
Backbone
OST OST196OST
Heads
924 Dual P4 Compute Nodes
B439GW GW MDS MDS
2 Login nodeswith 4 Gb-Enet
32 Gateway nodes @ 190 MB/sdelivered Lustre I/O over 2x1GbE
ALC - 960 Port QsNet Elan3
SW
SW
SW
SW
SW
SW
SW
SW
1,114 Dual P4 Compute Nodes
B439GW GW MDS MDS
4 Login nodeswith 4 Gb-Enet
32 Gateway nodes @ 190 MB/sdelivered Lustre I/O over 2x1GbE
MCR – 1,152 Port QsNet Elan3
Open Computing Facility (OCF) Clusters, Networks, StorageOpen Computing Facility (OCF) Clusters, Networks, StorageOpen Computing Facility (OCF) Clusters, Networks, Storage
-
THIC Meeting, San Jose CA, 10-Mar-2004 26
SNL’s new MPP architecturemachine is ASCI Red StormSNL’s new MPP architectureSNL’s new MPP architecturemachine is ASCI Red Stormmachine is ASCI Red Storm
• Scalability
• Reliability
• Usability
• Simplicity
• Cost effectiveness
-
THIC Meeting, San Jose CA, 10-Mar-2004 27
System Layout(27 x 16 x 24 mesh)
System LayoutSystem Layout(27 x 16 x 24 mesh)(27 x 16 x 24 mesh)
NormallyUnclassified
SwitchableNodes
NormallyClassified
Disconnect Cabinets
-
THIC Meeting, San Jose CA, 10-Mar-2004 28
Red Storm ArchitectureRed Storm ArchitectureRed Storm Architecture
• True MPP designed to be a single system• Distributed memory parallel supercomputer• Fully connected 3D mesh interconnect• 108 compute node cabinets• 10,368 processors (AMD Opteron @ 2.0 GHz)• ~10 TB of DDR memory @ 333MHz• Red/Black switching (classified/unclassified)• 8 Service and I/O cabinets on each end (256
processors for each color)• 240 TB of disk storage (120 TB per color)
-
THIC Meeting, San Jose CA, 10-Mar-2004 29
Purple is next 100 TF platform at LLNL, with 2 PB of disk storagePurple is next 100 TF platform at Purple is next 100 TF platform at LLNL, with 2 PB of disk storageLLNL, with 2 PB of disk storage
Parallel Batch/Interactive/Visualization Nodes
System Data and Control NetworksSystem Data and Control NetworksSystem Data and Control Networks
…
I/O … NFSLogin
LoginNet
NFSLogin
LoginNet
NFSLogin
LoginNet
NFSLogin
LoginNet
System Data and Control Networks
I/O I/O I/O I/O I/O I/O I/O … Fibre Channel 2 I/O Network
Specific Purple details still in contract negotiation
-
THIC Meeting, San Jose CA, 10-Mar-2004 30
OutlineOutlineOutline
• The NNSA Stockpile Stewardship Program
• Where we are now: ASCI Red, Blue, White, Q
• Challenges for today: Linux, Red Storm, Purple
• The future challenges: BlueGene/L and beyond
• A short digression: Storage and file systems
-
THIC Meeting, San Jose CA, 10-Mar-2004 31
BlueGene/L is being built as part of IBM Purple contract
BlueGene/L is being built as BlueGene/L is being built as part of IBM Purple contract part of IBM Purple contract
Compute Chip
2 processors2.8/5.6 GF/s
4 MiB* eDRAM
System
64 cabinets65,536 nodes
(131,072 CPUs)(32x32x64)
180/360 TF/s16 TiB*1.2 MW
2500 sq.ft.~11mm
(compare this with a 1988 Cray YMP/8 at 2.7 GF/s)
* http://physics.nist.gov/cuu/Units/binary.html
Compute CardI/O Card
FRU (field replaceable unit)
25mmx32mm2 nodes (4 CPUs)
(2x1x1)2.8/5.6 GF/s
256/512 MiB* DDR15 W
Node Card
16 compute cards0-2 I/O cards
32 nodes(64 CPUs)
(4x4x2)90/180 GF/s8 GiB* DDR
Midplane
SU (scalable unit)16 node boards
512 nodes(1,024 CPUs)
(8x8x8)1.4/2.9 TF/s
128 GiB* DDR7-10 kW
Cabinet
2 midplanes1024 nodes
(2,048 CPUs)(8x8x16)
2.9/5.7 TF/s256 GiB* DDR
15-20 kW
-
THIC Meeting, San Jose CA, 10-Mar-2004 32
BlueGene/L will have 32,768 dual-node compute cards
BlueGene/L will have 32,768 BlueGene/L will have 32,768 dualdual--node compute cardsnode compute cards
Heatsinks designed for 16W
54 mm (2.125”)
206 mm (8.125”) wide, 14 layers
6x180 pin connector
9 x 256/512Mb DRAM
-
THIC Meeting, San Jose CA, 10-Mar-2004 33
BlueGene/L will contribute at all length and time scales
BlueGene/L will contribute at BlueGene/L will contribute at all length and time scalesall length and time scales
Continuum
ALE3DPlasticity of
complex shapes
Mesoscale
NIKE3DAggregate grain response, poly-crystal plasticity
Microscale
Dislocation Dynamics
Collective behavior of defects, single crystal plasticity
Atomic Scale
Molecular Dynamics
unit mechanisms of defect
mobility and interaction
nm (10-9 m) µm (10-6 m) mm (10-3 m)
ps
ns
µs
ms
s
Tim
e Sc
ale
·σ = σ(ε, ε, T, P, ...)
first-time overlap calculations will allow direct comparison of detailed and course-scale models of plasticity of crystals
-
THIC Meeting, San Jose CA, 10-Mar-2004 34
BlueGene/L has a growing list of industry/academia collaboratorsBlueGene/L has a growing list of BlueGene/L has a growing list of industry/academia collaboratorsindustry/academia collaborators
Debugger
ApplicationsFile SystemsBatch systemKernel EvaluationProgramming ModelsDebugger & Vis
Network simulatorMPI tracingApplication scaling
Application Tracing & Performance
Hardware design and buildNetwork design and buildOS and system software
Parallel ObjectsCHARM++
STAPL – standard adaptive templatelibrary
Optimized FFT
MPI – messagepassing interface
PAPI - performancemonitoring
Performance analysisVampir/GuideView
-
THIC Meeting, San Jose CA, 10-Mar-2004 35
If industry progress slows, howto get simulation to next step?If industry progress slows, howIf industry progress slows, howto get simulation to next step?to get simulation to next step?
PITAC 1999 Recommendations
(President’s IT Advisory Committee)
— $$ for R&D on innovative computing technologies
— $$ for software research
— $$ for Petaflops on some applications by 2010
— $$ to fund the most powerful high-end systems
— Can this be leveraged into a broad national program??
-
THIC Meeting, San Jose CA, 10-Mar-2004 36
National Academies Report: The Future of Supercomputing
National Academies Report: National Academies Report: The Future of SupercomputingThe Future of Supercomputing
The Future of Supercomputing: An Interim Report (2003)from Computer Science and Telecommunications Boardhttp://www7.nationalacademies.org/cstb/pub_supercomp_int.html
• Sponsored by DOE Office of Science and DOE ASC• Goals of the study
— Supercomputing R&D in support of U.S. needs— Context and background— Applications and implications for design— Market, national security, role of U.S. government— Options for progress/recommendations (final report, 2004)
-
THIC Meeting, San Jose CA, 10-Mar-2004 37
National Academies Report: The Future of Supercomputing
National Academies Report: National Academies Report: The Future of SupercomputingThe Future of Supercomputing
Some observations from the Interim Report• U.S. is in pretty good shape regarding manufacturing• Custom and commodity species have own niches• Need balance between customization and commodity• Need balance between evolution and innovation• Need continuity and sustained investment• Government role essential (market incentives insufficient)• Supercomputer software is not in good shape
— Hard to program— Inadequate development tools— Legacy code porting problems
-
THIC Meeting, San Jose CA, 10-Mar-2004 38
OutlineOutlineOutline
• The NNSA Stockpile Stewardship Program
• Where we are now: ASCI Red, Blue, White, Q
• Challenges for today: Linux, Red Storm, Purple
• The future challenges: BlueGene/L and beyond
• A short digression: Storage and file systems
-
THIC Meeting, San Jose CA, 10-Mar-2004 39
Some interesting things about 70’s, 80’s, early 90’s (G. Grider)Some interesting things about Some interesting things about 70’s, 80’s, early 90’s (G. 70’s, 80’s, early 90’s (G. GriderGrider))
• Many (a dozen or so) supercomputers, all disk poor• Common serial file system and archive (integrated)• Archival file system was less than an order of
magnitude slower than supercomputer local disk• Invented own networks, out of parallel technology
—Invented our own protocols, much like IP today—3-5 MB/sec in 80’s when fast networks were 56 kb/s—HIPPI is an example, 100 MB/sec in 1989, when
most networks were 10 Mb/s and fast networks were 50 Mb/s
This slide and the following three slides courtesy of Gary Grider, LANLThis slide and the following three slides courtesy of Gary Grider, LANL
-
THIC Meeting, San Jose CA, 10-Mar-2004 40
Enter ASCI in the mid-90’sEnter ASCI in the midEnter ASCI in the mid--90’s90’s
• In the mid 90’s the ASCI program attempted to accelerate things through the use of massive parallelism.
• We moved to a new model for balanced system, with new ratios for storage feeds and speeds.
• Developed a forward-looking data transfer and storage technology roadmap to address barriers.
-
THIC Meeting, San Jose CA, 10-Mar-2004 41
Some interesting thingsabout the late 90’s
Some interesting thingsSome interesting thingsabout the late 90’sabout the late 90’s
• Went from a dozen supercomputers to 1 or 2, and from disk-poor supercomputers to disk-rich
• Supercomputer file system had to become parallel and use supercomputer interconnect to move data
• Each supercomputer had to have its own parallel file system (not a common file system)
• Went from integrated common file system with archive to separate parallel archive (HPSS)
• Archive now over an order of magnitude slower than supercomputer parallel file system
-
THIC Meeting, San Jose CA, 10-Mar-2004 42
Some interesting things about the current 21st century trend
Some interesting things about Some interesting things about the current 21the current 21stst century trendcentury trend
Due to lower costs for supercomputers—Going back to lots of lower cost supercomputers
that are disk-poor—Probably need to move towards a scalable
common parallel file system—Probably need to integrate parallel archive and
common parallel file system—Probably need to have a parallel multi-
supercomputer secure scalable backbone
We have not been sitting idle hoping for “magic happens here” solutionsWe have not been sitting idle hoping for “magic happens here” solutions
-
THIC Meeting, San Jose CA, 10-Mar-2004 43
HPSS archival storage slide from last time I was here…
HPSS archival storage slide HPSS archival storage slide from last time I was here…from last time I was here…
Accomplishments— A 20x performance increase in
15 months (faster nets and disks)— PSE Milepost demonstrated
170 MB/s aggregate throughput White-to-HPSS
— Large single file transfer rates of up to 80MB/s White-to-HPSS
— Large singe file transfer rates of up to 150MB/s White-to-SGI
Challenges— Yearly doubling of throughput
is needed for next machine
At 170 MB/s, 2TB of data moves to At 170 MB/s, 2TB of data moves to storage in less than 4 hours. A year storage in less than 4 hours. A year and a half ago it took two and a half and a half ago it took two and a half days to move the same amount of datadays to move the same amount of data
Aggregate Throughput to Storage
1 MB/s 4 MB/s 6 MB/s 9 MB/s
120 MB/s
170 MB/s
0
20
40
60
80
100
120
140
160
180
FY96 FY97 FY98 FY99 FY00 FY01
MB/
sMoved to
HPSS
Moved to SP Nodes
Moved to Jumbo GE & Parallel Striping
Moved to Faster Disk on Faster Nodes & multi-node Concurrency
-
THIC Meeting, San Jose CA, 10-Mar-2004 44
LLNL’s yearly “I/O Blueprints” have helped to increase ratesLLNL’s yearly “I/O Blueprints” LLNL’s yearly “I/O Blueprints” have helped to increase rateshave helped to increase rates
A 115x performance improvement in four years!
Aggregate Throughput to Storage
1,037 MB/s
170 MB/s120 MB/s
9 MB/s6 MB/s4 MB/s1 MB/s
854 MB/s
0
200
400
600
800
1000
1200
FY96 FY97 FY98 FY99 FY00 FY01 FY02 FY03
MB/
s
Moved to HPSS
Moved to SP Nodes
Moved to Jumbo GE, Parallel Striping, Faster Disk & Nodes using multiple pftp sessions
Moved to Faster Disk using multiple Htar sessions on multiple nodes
12/03 Throughput
-
THIC Meeting, San Jose CA, 10-Mar-2004 45
Current HPSS data movement with HSM disk/tape hierararchyCurrent HPSS data movement Current HPSS data movement
with HSM disk/tape with HSM disk/tape hierararchyhierararchy
HPSSMover
HPSSMover
HPSS HPSS DiskDisk
MCR or ALC
PlatformPlatformDiskDisk
Client Mover
PFTP Client
Application1
2
4
3 5
6
7
Key elements1. User in the loop to force keep/delete decision2. Large HPSS disk cache and multiple copies on disk and tape3. Bandwidth limited by PFTP bandwidth off compute platform
-
THIC Meeting, San Jose CA, 10-Mar-2004 46
Lustre Object Storage Target HPSS Data Movement VisionLustre Object Storage Target Lustre Object Storage Target HPSS Data Movement VisionHPSS Data Movement Vision
Tape Front-End
Lustre Lustre DiskDisk
Global Accessible Lustre OST
Application
MCR or ALCPlatform
1 2open(
), seek
(), read
()
3
Location Independent
Client ArchiveAgent
Key elements1. User in the loop to force keep/delete decision2. Direct to tape transfers eliminate HPSS disk cache3. Bandwidth limited by SERIAL file system bandwidth access from HPSS
-
THIC Meeting, San Jose CA, 10-Mar-2004 47
Proposed HPSS parallel local file movers for tape parallelismProposed HPSS parallel local Proposed HPSS parallel local
file movers for tape parallelismfile movers for tape parallelism
Lustre Lustre DiskDisk
Application
MCR or ALCCapacity Platform
Location independent
PFTP Client
1 2…
HPSSHPSSPLFMPLFM
HPSSHPSSPLFMPLFM
HPSSHPSSPLFMPLFM
HPSSPLFM
3
PLFMs are location independentope
n(), see
k(), rea
d()
Key elements1. User in the loop to force keep/delete decision2. Direct to tape transfers eliminate HPSS disk cache3. Improved by PARALLEL file system bandwidth access from HPSS
-
THIC Meeting, San Jose CA, 10-Mar-2004 48
Tri-lab historical timeline forscalable, parallel file systemsTriTri--lab historical timeline forlab historical timeline for
scalable, parallel file systemsscalable, parallel file systemsRFQ, analysis, recommend funding open source OBSD development and NFSv4 efforts
PathForward proposal with OBSD vendor, Panasas born
20022000 20011999
Proposed PathForward activity for SGSFS
Propose initial architecture
SGSFS workshop “You’re Crazy”
Build initial requirements document
PathForward team formed to pursue an RFI/RFQ approach, RFI issued, recommend RFQ process
Begin partnering talks negotiations for OBSD and NFSv4 PathForwards
HECRTF workshop: Re-invent Posix I/O ?
“Are WeStill Crazy?”
Tri-Lab joint requirements document complete
Lustre PathForward effort is born
Alliance contracts placed with universities on OBSD, overlapped I/O and NFSv4
U MinnObject Archive begins
20042003
-
THIC Meeting, San Jose CA, 10-Mar-2004 49
From the June 2003 HECRTFworkshop report (available)
From the June 2003 HECRTFFrom the June 2003 HECRTFworkshop report (available)workshop report (available)
• For info: http://www.nitrd.gov/hecrtf-outreach/index.html• NNSA Tri-labs (Lee Ward of SNL, Tyce McClarty of LLNL,
and Gary Grider of LANL) were lone I/O representatives at this workshop
• Overwhelming consensus that POSIX I/O is inadequate
5.5. Data Management and File SystemsWe believe legacy, POSIX I/O interfaces are incompatible with the full range of hardware architecture choices contemplated …
The interface does not fully support the needs for parallel support along the I/O path …
An alternative, appropriate operating system API should be developed for high-end computing systems …
-
THIC Meeting, San Jose CA, 10-Mar-2004 50
DISCLAIMER
This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes.
This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.
OutlineThe NNSA’s StockpileStewardship Program (SSP)Simulation plays central role to maintain stockpile confidenceRole of Advanced Simulationand Computing Program (ASC)Confluence of events leadsto creation of ASCI in 1996ASC Program is more than platforms and physics codesOutlineASCI Red, Blue, White, QASCI Red at SNLASCI Blue Pacific at LLNLASCI White IBM Nighthawk-2Compute Node SpecificationASCI Q at LANL is Alpha ES45SMP with Quadrics interconnectPlatforms are shaken out with groundbreaking science runsOutlineLLNL strategy is to straddlemultiple technology wavesRamifications of LLNL strategyMaintain similar programming model across LLNL platformsA supercomputing caveat: a new facility may be requiredThe MCR Linux Cluster made 10+ teraFLOPS computing affordableLANL announced Lightning Linux Cluster in August 2003Thunder is newest LLNL cluster:a 1000+ node quad Itanium2Key Open Source Softwarefor MCR and Thunder clustersLinux cluster and opensource observationsSNL’s new MPP architecturemachine is ASCI Red StormSystem Layout(27 x 16 x 24 mesh)Red Storm ArchitecturePurple is next 100 TF platform at LLNL, with 2 PB of disk storageOutlineBlueGene/L is being built as part of IBM Purple contractBlueGene/L will have 32,768 dual-node compute cardsBlueGene/L will contribute at all length and time scalesBlueGene/L has a growing list of industry/academia collaboratorsIf industry progress slows, howto get simulation to next step?National Academies Report: The Future of SupercomputingNational Academies Report: The Future of SupercomputingOutlineSome interesting things about 70’s, 80’s, early 90’s (G. Grider)Enter ASCI in the mid-90’sSome interesting thingsabout the late 90’sSome interesting things about the current 21st century trendHPSS archival storage slide from last time I was here…LLNL’s yearly “I/O Blueprints” have helped to increase ratesCurrent HPSS data movement with HSM disk/tape hierararchyLustre Object Storage Target HPSS Data Movement VisionProposed HPSS parallel local file movers for tape parallelismTri-lab historical timeline forscalable, parallel file systemsFrom the June 2003 HECRTFworkshop report (available)
top related