chip-multiprocessors & you
DESCRIPTION
Chip-Multiprocessors & You. John Dennis [email protected] March 16, 2007. Intel “Tera Chip”. 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process 45 nm technology High-K 2D mesh network Each processor has 5-port router Connects to “3D-memory”. Outline. Chip-Multiprocessor - PowerPoint PPT PresentationTRANSCRIPT
March 16, 2007 Software Engineering Working Group Meeting
2
Intel “Tera Chip”Intel “Tera Chip” 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process
45 nm technology High-K
2D mesh network Each processor has 5-
port router Connects to “3D-
memory”
80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process
45 nm technology High-K
2D mesh network Each processor has 5-
port router Connects to “3D-
memory”
March 16, 2007 Software Engineering Working Group Meeting
3
OutlineOutline
Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts
POPCICE
Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts
POPCICE
March 16, 2007 Software Engineering Working Group Meeting
4
Moore’s LawMoore’s Law
Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density
Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time
--> Inactivity leads to progress!
Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density
Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time
--> Inactivity leads to progress!
5
The advent of Chip-multiprocessors
Moore’s Law gone bad!
March 16, 2007 Software Engineering Working Group Meeting
6
New implications of Moore’s Law
New implications of Moore’s Law
Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly
18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!
Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly
18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!
March 16, 2007 Software Engineering Working Group Meeting
7
New implications of Moore’s Law
(con’t) New implications of Moore’s Law
(con’t) Inactivity leads to no progress! Possible outcome
Same problem size / same parallelismsolve problem ~15% faster
Bigger problem sizescalable memory?
More processors enable ~2x reduction in time to solutionNon-scalable memory?
May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?
All components of application must scale to benefit from Moore’s Law increases!
Memory footprint problem will not solve itself!
Inactivity leads to no progress! Possible outcome
Same problem size / same parallelismsolve problem ~15% faster
Bigger problem sizescalable memory?
More processors enable ~2x reduction in time to solutionNon-scalable memory?
May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?
All components of application must scale to benefit from Moore’s Law increases!
Memory footprint problem will not solve itself!
March 16, 2007 Software Engineering Working Group Meeting
8
Questions ?Questions ?
9
Parallel I/O library (PIO)
John Dennis ([email protected])Ray Loy ([email protected])
March 16, 2007
March 16, 2007 Software Engineering Working Group Meeting
10
IntroductionIntroduction
All component models need parallel I/OSerial I/O is bad!
Increased memory requirementTypically negative impact on performance
Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs
All component models need parallel I/OSerial I/O is bad!
Increased memory requirementTypically negative impact on performance
Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs
March 16, 2007 Software Engineering Working Group Meeting
11
Design goalsDesign goals
Provide parallel I/O for all component models
Encapsulate complexity into library Simple interface for component
developers to implement
Provide parallel I/O for all component models
Encapsulate complexity into library Simple interface for component
developers to implement
March 16, 2007 Software Engineering Working Group Meeting
12
Design goals (con’t)Design goals (con’t)
Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats
{sequential,direct} binarynetcdf
Preserve format of input/output filesSupports 1D, 2D and 3D arrays
Currently XYExtensible to XZ or YZ
Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats
{sequential,direct} binarynetcdf
Preserve format of input/output filesSupports 1D, 2D and 3D arrays
Currently XYExtensible to XZ or YZ
March 16, 2007 Software Engineering Working Group Meeting
13
Terms and ConceptsTerms and Concepts
PnetCDF: [ANL]High performance I/ODifferent interfaceStable
netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene
PnetCDF: [ANL]High performance I/ODifferent interfaceStable
netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene
March 16, 2007 Software Engineering Working Group Meeting
14
Terms and Concepts (con’t)
Terms and Concepts (con’t)
Processor stride:Allows matching of subset of MPI IO nodes
to system hardware
Processor stride:Allows matching of subset of MPI IO nodes
to system hardware
March 16, 2007 Software Engineering Working Group Meeting
15
Terms and Concepts (con’t)
Terms and Concepts (con’t)
IO decomp vs. COMP decompIO decomp == COMP decomp
MPI-IO + message aggregation
IO decomp != COMP decompNeed Rearranger : MCT
No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten
2D and 3D arrays
IO decomp vs. COMP decompIO decomp == COMP decomp
MPI-IO + message aggregation
IO decomp != COMP decompNeed Rearranger : MCT
No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten
2D and 3D arrays
March 16, 2007 Software Engineering Working Group Meeting
16
Component Model ‘issues’Component Model ‘issues’
POP & CICE:Missing blocks
Update of neighbors haloWho writes missing blocks?
Asymmetry between read/write
‘sub-block’ decompositions not rectangular
CLMDecomposition not rectangularWho writes missing data?
POP & CICE:Missing blocks
Update of neighbors haloWho writes missing blocks?
Asymmetry between read/write
‘sub-block’ decompositions not rectangular
CLMDecomposition not rectangularWho writes missing data?
March 16, 2007 Software Engineering Working Group Meeting
17
What worksWhat worksBinary I/O [direct]
Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement
netCDFnetCDF
Rearrange with MCT [New]Reduced memory
PnetCDF:Rearrange with MCTNo rearrangement
Test on POWER5, BGL
Binary I/O [direct]Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement
netCDFnetCDF
Rearrange with MCT [New]Reduced memory
PnetCDF:Rearrange with MCTNo rearrangement
Test on POWER5, BGL
March 16, 2007 Software Engineering Working Group Meeting
18
What works (con’t)What works (con’t) Prototype added to POP2
Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs
Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly
POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06
Performance POWER5: 2-3x serial I/O approach BGL: mixed
Prototype added to POP2 Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs
Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly
POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06
Performance POWER5: 2-3x serial I/O approach BGL: mixed
March 16, 2007 Software Engineering Working Group Meeting
19
Complexity / Remaining IssuesComplexity / Remaining Issues
Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,
MPI-IO)Subarrays: start + count (pNetCDF)
Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM
Need common language for interfaceInterface between component model
and library
Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,
MPI-IO)Subarrays: start + count (pNetCDF)
Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM
Need common language for interfaceInterface between component model
and library
March 16, 2007 Software Engineering Working Group Meeting
20
ConclusionsConclusions
Working prototypePOP2 for binary I/OHOMME for netCDF
PIO telecon: discuss progress every 2 weeks
Work in progress Multiple efforts underwayaccepting help
http://swiki.ucar.edu/ccsm/93In CCSM subversion repository
Working prototypePOP2 for binary I/OHOMME for netCDF
PIO telecon: discuss progress every 2 weeks
Work in progress Multiple efforts underwayaccepting help
http://swiki.ucar.edu/ccsm/93In CCSM subversion repository
March 16, 2007 Software Engineering Working Group Meeting
22
MotivationMotivationCan Community Climate System Model
(CCSM) be a Petascale Application?Use 10-100K processors per simulation
Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]
Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015
Can Community Climate System Model (CCSM) be a Petascale Application?Use 10-100K processors per simulation
Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]
Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015
March 16, 2007 Software Engineering Working Group Meeting
23
OutlineOutline
Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts
POPCICE
Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts
POPCICE
March 16, 2007 Software Engineering Working Group Meeting
24
Status of POPStatus of POP
Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver
Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]
110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation
Includes a suite of dye-like tracersSimulate eddy diffusivity tensor
Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver
Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]
110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation
Includes a suite of dye-like tracersSimulate eddy diffusivity tensor
March 16, 2007 Software Engineering Working Group Meeting
25
Status of POP (con’t)Status of POP (con’t)
Allocation will occur over ~7 daysRun in production on 30K
processorsNeeds Parallel I/O to write history
fileStart runs in 4-6 weeks
Allocation will occur over ~7 daysRun in production on 30K
processorsNeeds Parallel I/O to write history
fileStart runs in 4-6 weeks
March 16, 2007 Software Engineering Working Group Meeting
26
OutlineOutline
Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts
POPCICE
Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts
POPCICE
March 16, 2007 Software Engineering Working Group Meeting
27
Status of CICEStatus of CICE
Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW
days]Use weighted space-filling curves
(wSFC)erfcclimatology
Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW
days]Use weighted space-filling curves
(wSFC)erfcclimatology
March 16, 2007 Software Engineering Working Group Meeting
28
POP (gx1v3) + Space-filling curve
POP (gx1v3) + Space-filling curve
March 16, 2007 Software Engineering Working Group Meeting
29
Space-filling curve partition for 8 processors
Space-filling curve partition for 8 processors
March 16, 2007 Software Engineering Working Group Meeting
30
Weighted Space-filling curves
Weighted Space-filling curves
Estimate work for each grid block
Worki = w0 + Pi*w1
where:w0: Fixed work for all blocks
w1: Work if block contains Sea-ice
Pi: Probability block contains Sea-ice
For our experiments: w0 = 2, w1 = 10
Estimate work for each grid block
Worki = w0 + Pi*w1
where:w0: Fixed work for all blocks
w1: Work if block contains Sea-ice
Pi: Probability block contains Sea-ice
For our experiments: w0 = 2, w1 = 10
March 16, 2007 Software Engineering Working Group Meeting
31
Probability FunctionProbability Function
Error Function:Pi = erfc(( -max(|lati|))/)
where:lati max lat in block i
mean sea-ice extent variance in sea-ice extent
NH=70°, SH =60°, =5 °
Error Function:Pi = erfc(( -max(|lati|))/)
where:lati max lat in block i
mean sea-ice extent variance in sea-ice extent
NH=70°, SH =60°, =5 °
March 16, 2007 Software Engineering Working Group Meeting
32
1° CICE4 on 20 processors1° CICE4 on 20 processors
Small domains @ high latitudes
Large domains @ low latitudes
March 16, 2007 Software Engineering Working Group Meeting
33
0.1° CICE40.1° CICE4 Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP
Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:
~15% of grid has sea-ice Use weighted Space-filling curves?
Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing
Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP
Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:
~15% of grid has sea-ice Use weighted Space-filling curves?
Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing
March 16, 2007 Software Engineering Working Group Meeting
34
CICE4 @ 0.1°CICE4 @ 0.1°
March 16, 2007 Software Engineering Working Group Meeting
35
Timings for 1°,npes=160, NH=70°Timings for 1°,npes=160, NH=70°
Load-imbalance: Hudson Bay south of 70°
March 16, 2007 Software Engineering Working Group Meeting
36
Timings for 1°,npes=160, NH=55°Timings for 1°,npes=160, NH=55°
March 16, 2007 Software Engineering Working Group Meeting
37
Better Probability FunctionBetter Probability Function Climatological Function:
Where:
ij climatological maximum sea-ice extent [satellite observation]
ni is the number of points within block i with non-
zero ij
Climatological Function:
Where:
ij climatological maximum sea-ice extent [satellite observation]
ni is the number of points within block i with non-
zero ij
€
€
Pi =1.0 if φij
j
∑ ni ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟≥ 0.1
0.0
⎧
⎨ ⎪
⎩ ⎪
March 16, 2007 Software Engineering Working Group Meeting
38
Timings for 1°,npes=160, climate-based
Timings for 1°,npes=160, climate-based
Reduces dynamics sub-cycling time by 28%!
March 16, 2007 Software Engineering Working Group Meeting
39
Acknowledgements/Questions?
Acknowledgements/Questions?
Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)
Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)
Computer Time: Blue Gene/L time:
NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program
BGW Consortium DaysIBM research (Watson)
Cray XT3/4 time:
ORNL
Sandia
Computer Time: Blue Gene/L time:
NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program
BGW Consortium DaysIBM research (Watson)
Cray XT3/4 time:
ORNL
Sandia
et
March 16, 2007 Software Engineering Working Group Meeting
40
Partitioning with Space-filling Curves
Partitioning with Space-filling Curves
Map 2D -> 1DVariety of sizes
Hilbert (Nb=2n)
Peano (Nb=3m)
Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco
(Nb=2n3m5p)
Partitioning 1D array
Map 2D -> 1DVariety of sizes
Hilbert (Nb=2n)
Peano (Nb=3m)
Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco
(Nb=2n3m5p)
Partitioning 1D array
Nb
March 16, 2007 Software Engineering Working Group Meeting
41
Scalable data structuresScalable data structuresCommon problem among applicationsWRF
Serial I/O [fixed]Duplication of lateral boundary values
POP & CICESerial I/O
CLMSerial I/ODuplication of grid info
Common problem among applicationsWRF
Serial I/O [fixed]Duplication of lateral boundary values
POP & CICESerial I/O
CLMSerial I/ODuplication of grid info
March 16, 2007 Software Engineering Working Group Meeting
42
Scalable data structures (con’t)
Scalable data structures (con’t)
CAMSerial I/OLookup tables
CPLSerial I/ODuplication of grid info
Memory footprint problem will not solve itself!
CAMSerial I/OLookup tables
CPLSerial I/ODuplication of grid info
Memory footprint problem will not solve itself!
March 16, 2007 Software Engineering Working Group Meeting
43
Remove Land blocksRemove Land blocks
March 16, 2007 Software Engineering Working Group Meeting
44
Case Study:Memory use in CLM
Case Study:Memory use in CLM
CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM
Measure stack and heap on 32-512 BG/L processors
CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM
Measure stack and heap on 32-512 BG/L processors
March 16, 2007 Software Engineering Working Group Meeting
45
Memory use of CLM on BGL
Memory use of CLM on BGL
March 16, 2007 Software Engineering Working Group Meeting
46
Motivation (con’t)Motivation (con’t)Multiple efforts underway
CAM scalability + high resolution coupled simulation [A. Mirin]
Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.
Jacob]HOMME in CAM [J. Edwards]
Multiple efforts underwayCAM scalability + high resolution coupled
simulation [A. Mirin]Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.
Jacob]HOMME in CAM [J. Edwards]
March 16, 2007 Software Engineering Working Group Meeting
47
OutlineOutline
Chip-MultiprocessorFun with Large Processor Counts
POPCICECLM
Parallel I/O library (PIO)
Chip-MultiprocessorFun with Large Processor Counts
POPCICECLM
Parallel I/O library (PIO)
March 16, 2007 Software Engineering Working Group Meeting
48
Status of CLMStatus of CLM
Work of T. CraigElimination of global memory
Reworking of decomposition algorithms
Addition of PIOShort term goal:
Participation in BGW days June 07Investigation scalability at 1/10
Work of T. CraigElimination of global memory
Reworking of decomposition algorithms
Addition of PIOShort term goal:
Participation in BGW days June 07Investigation scalability at 1/10
March 16, 2007 Software Engineering Working Group Meeting
49
Status of CLM memory usage
Status of CLM memory usage
May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL
July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]
January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]
February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]
Target: no persistent global arrays 1/10 degree runs on single rack BGL
May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL
July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]
January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]
February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]
Target: no persistent global arrays 1/10 degree runs on single rack BGL
March 16, 2007 Software Engineering Working Group Meeting
50
Proposed Petascale ExperimentProposed Petascale Experiment
Ensemble of 10 runs/200 years Petascale Configuration:
CAM (30 km, L66) POP @ 0.1°
12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°
42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°
Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors
Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors
Ensemble of 10 runs/200 years Petascale Configuration:
CAM (30 km, L66) POP @ 0.1°
12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°
42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°
Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors
Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors
March 16, 2007 Software Engineering Working Group Meeting
51
POPIO benchmark on BGWPOPIO benchmark on BGW
March 16, 2007 Software Engineering Working Group Meeting
52
CICE results (con’t)CICE results (con’t) Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:
Large domains at low latitude -> higher boundary exchange cost
Small domains at high latitude-> lower floating-point cost
Optimal balance of computational and communication cost?
Work in progress!
Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:
Large domains at low latitude -> higher boundary exchange cost
Small domains at high latitude-> lower floating-point cost
Optimal balance of computational and communication cost?
Work in progress!