copyright hiperism consulting, llc, 2013 george delic, ph.d. hiperism consulting, llc (919)484-9803...

28
Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com George Delic , Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC 27514 [email protected] http://www.hiperism.com HiPERiSM Consulting, LLC.

Upload: randolf-holt

Post on 18-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

George Delic , Ph.D.

HiPERiSM Consulting, LLC

(919)484-9803

P.O. Box 569,

Chapel Hill, NC [email protected]

http://www.hiperism.com

HiPERiSM Consulting, LLC.

Page 2: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

George DelicHiPERiSM Consulting, LLC

UPDATE ON A NEW PARALLEL SPARSE CHEMISTRY SOLVER FOR

CMAQ

12th Annual CMAS Conference,Chapel Hill, NC

30 October, 2013

Page 3: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Overview: CMAQ from HiPERiSM and the U.S. EPA

Overview: CMAQ from HiPERiSM and the U.S. EPA Hardware platforms Software and compilers Episode studied Thread parallel performance metrics 2 compilers, 2 platforms (24hr run) Chemistry solver parallel efficiency (1 hr run) Accuracy metrics for sparse solution of Ax=y CMAQ numerical performance Numerical error in U.S. EPA code Concentrations for O3, NO2 at hour 23 Lessons learned Conclusions Next Steps for CMAQ development

Page 4: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Hardware platforms

Intel: 2x4-core CPU’s = 8 cores W5590 Nehalem™ 3.3 GHz

AMD: 4x12-core CPU’s = 48 cores 6176SE Opteron™ 2.3 GHz

Page 5: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Software and compilers OS

Linux 64-bit

CMAQ versions (Rosenbrock solver*) U.S. EPA’s uses JSPARSE (serial) HiPERiSM uses FSPARSE (parallel)

Compilers (legend) Intel 12.1 (ifort/Intel)Portland 13.4 (pgf90)

(*) Requires a sparse linear solver in a linear system Ax=y for chemistry solution: FSPARSE replaces JSPARSE

Page 6: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Episode studied

Grid used279 X 240 Eastern US domain at 12 Km grid

spacing and 34 vertical layers

CMAQ 4.7.1 24-hour episodeAugust 09, 2006, using the CB05 mechanism with

Chlorine extensions and the Aero 4 version for PM modeling.

total output file size: ~ 37.7 GB (137 variables)

Page 7: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Thread parallel performance metrics

SPEEDUP:U.S. EPA time / Thread parallel time

PARALLEL SCALING: SP= T1 / TP

PARALLEL EFFICIENCY: EP= SP / P

T1 is runtime for a single thread

TP is runtime for P threads

Page 8: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

2 compilers, 2 platforms (24hr run)

16

20

24

28

32

36

40

44

48

52

56

60

64

68

EPA OMP1 OMP2 OMP4 OMP6 OMP8

Wal

l clo

ck t

ime

(ho

urs

)

ifort on AMD nodepgf90 on AMD nodeifort on Intel nodepgf90 on Intel node

0.650.700.750.800.850.900.951.001.051.101.151.201.251.301.351.401.451.50

OMP1 OMP2 OMP4 OMP6 OMP8

Sp

ee

d u

p v

ers

us

EP

A

ifort on AMD nodepgf90 on AMD nodeifort on Intel nodepgf90 on Intel node

← CMAQ wall clock time (hours) for EPA and parallel versions with 1 to 8 threads on Intel and AMD platforms

Parallel CMAQ speedup versus EPA for 1 to 8 threads on Intel and AMD platforms →

Page 9: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Chemistry solver parallel efficiency(pgf90 on Intel node, 1hr run)

Parallel efficiency > 87% with 2-6 threads.

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

0 10 20 30 40 50 60

Simulation time (minute)

Iter

atio

n p

aral

lel

effi

cien

cy

OMP2OMP4OMP6OMP8

Parallel efficiency by thread count (2-8)

Page 10: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

CMAQ 4.6.1 MPI Efficiency & (estimated OpenMP speedup)

MPI (OpenMP)

hours Speed-up (OpenMP)

MPI efficiency

2 15.1 1.9 96%

4 8.2 3.5 (x 1.3) 88%

8 5.1 5.7 (x 1.4) 71%

16 3.3 8.7 (x 1.5) 54%

Portland compiler on x86_64 cluster

Page 11: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Accuracy Metrics for sparse solution of Ax = y

Value Norm1) Statistic2)

Residual norm(Ax-y,inf) mean or sample

Solution norm(x,inf) mean or sample

1) Used the “inf” norm, or maximum value, over the vector Ax-y of length equal to the number of chemistry species.

2) Mean over cells in each block, or sampled at one cell in each of 47,430 blocks over the grid domain.

Page 12: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

CMAQ numerical performance

1.E-24

1.E-21

1.E-18

1.E-15

1.E-12

1.E-09

1.E-06

1.E-03

0 5000 10000 15000 20000

Block number

Res

idu

al HCEPA

At the end of the first simulation hour this shows the norm of the residual Ax-y at the last call to the CMAQ chemistry solver sampled in cell 48 for each of 47,430 blocks

1.E-24

1.E-21

1.E-18

1.E-15

1.E-12

1.E-09

1.E-06

1.E-03

23716 28716 33716 38716 43716

Block number

Res

idu

al HCEPA

norm(Ax-y,inf) in JSPARSE ( ■ ) and FSPARSE ( ■ ) methods

Page 13: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Numerical error in U.S. EPA code

• Uses mixed mode arithmetic (DP & SP)• Inconsistent promotion of SP to DP for

constants and variables• Worst case in CALCKS for thermal and

photolytic reaction rates computed in SP• Inherited SP values amplify precision loss

in three Rosenbrock solve stages• Use of ATOL = 1E-07 is moot

Page 14: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Concentrations for O3 at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:

difference in predictions ( ■ ) and concentration value ( ■ )

0.001

0.010

0.100

1.000

10.000

100.000

1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01

Value at bin upper boundary

O3

fre

qu

en

cy

(p

erc

en

t o

f a

ll c

ou

nts

)

Concentration value

Difference = EPA - OMP1

Page 15: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Concentrations for NO2 at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:

difference in predictions ( ■ ) and concentration value ( ■ )

0.01

0.10

1.00

10.00

100.00

1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01

Value at bin upper boundary

NO

2 fr

eq

uen

cy (

per

cen

t o

f al

l co

un

ts)

Concentration value

Difference = EPA - OMP1

Page 16: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Lessons learned

1. Limitations due to EPA’s inconsistent use of mixed mode arithmetic

2. FSPARSE method is more precise by many orders of magnitude

3. FSPARSE method allows relaxation of chemistry time step convergence parameter ATOL

1. JSPARSE & FSPARSE showed good agreement for values of O3, NO2, NO, H2O2

2. Degraded agreement for species such as ASO4I

3. Remaining differences result from cumulative errors in EPA code.

Numerical precision Species Concentrations

Page 17: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Conclusions CMAQ computational performance shows

speedup in the range 1.4-1.5 with two compilers on two platforms in a thread parallel model for the Rosenbrock solver when compared to the U.S. EPA release

The FSPARSE algorithm yields more precision in a sparse matrix chemistry solver when compared to the U.S. EPA release

The FSPARSE algorithm offers performance gains that are portable across platforms and compilers

Page 18: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Next steps for CMAQ development Short term goals

OpenMP parallel model extensions to other code portions of CMAQ

Explore port of FSPARSE to GPGPU technology

Long term goals Plan for code architecture (re)design throughout

the whole of CMAQ to change the memory footprint & increase computational efficiency

Develop thread safe version of CMAQ with the Gear solver

Page 19: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Extra Slides

Page 20: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Chemistry solver time step count(pgf90 on Intel node, 1hr run)

← CMAQ time step count for EPA and parallel (single thread) versions with ATOL=1E-07

CMAQ time step count for parallel (single thread) version with ATOL=1E-05 →

1.0

1.5

2.0

2.5

3.0

3.5

0 10 20 30 40 50 60

Simulation time (minute)

Iter

atio

n c

ou

nt

(10

** 5

)

EPAOMP1RATIO

0.5

1.0

1.5

2.0

2.5

3.0

0 10 20 30 40 50 60

Simulation time (minute)

Iter

atio

n c

ou

nt

(10

** 5

)

EPAOMP1RATIO

Page 21: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Chemistry solver scaling & speedup(pgf90 on Intel node, 1hr run)

← Parallel CMAQ scaling by thread count versus single thread with ATOL=1E-05.

Parallel CMAQ speedup by thread count (1-8) versus EPA →

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

0 10 20 30 40 50 60

Simulation time (minute)

Iter

atio

n s

cali

ng

vs

1 th

read

OMP2OMP4OMP6OMP8

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0 10 20 30 40 50 60

Simulation time (minute)

Iter

atio

n t

hre

ad s

pee

du

p v

s E

PA

OMP1OMP2OMP4OMP6OMP8

Page 22: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Concentrations for ASO4I at hour 23Histogram of all 66,960 concentration values of Layer 1 in decade bins:

difference in predictions ( ■ ) and concentration value ( ■ )

0.00

0.01

0.10

1.00

10.00

100.00

1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01

Value at bin upper boundary

AS

O4I

fre

qu

ency

(p

erce

nt

of

all

cou

nts

)

Concentration value

Difference = EPA - OMP1

Page 23: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Parallel paradigm nomenclature

Parallel paradigmsMPI = Message Passing Interface

(coarse grain chunks of work)

OpenMP = a thread based model (fine grain chunks of work)

Vector/SSE = instruction level (really fine grain tasks)

GPGPU = General PurposeGraphical Processing

Unit (multi-grain tasks)

Band-width increases

& latency decreases

Page 24: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Software Evolution

Compiler technology has grown CMAQ software development for

computational efficiency is laggingCMAQ users need more

throughput as problem size grows Penalty for not adapting to growth:

Lost performance (more than10x)Decrease in efficiency & throughput

Page 25: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Riding the revolution

HPC Mantra“Map the model to the architecture”

Shared Memory Parallel modelOpenMP port with up to 24 threadsGPGPU port with upto 100’s of threads

Decision pointsAssessing the level-of-effort to adaptBlending with existing MPI models

Page 26: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

CMAQ has not kept up-to-date with HPC growth

Why?Architecture has evolved rapidly to support

multiple levels of parallelismCMAQ traditionally uses only one level of

parallelismModel development has effectively moved

CMAQ work load balance in the direction of more scalar work

Page 27: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Parallel CMAQ approach(old parallel school: 1980’s)

Data parallelismPartition data domain (i.e. grid)Distribute partitions to cluster nodes

Apply MPITo distribute coarse work chunksCo-ordinate synchronization & data

collection

Page 28: Copyright HiPERiSM Consulting, LLC, 2013  George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com

Proposed parallel CMAQ approach(new parallel school: 2000’s)

Task parallelism (OpenMP)Distribute tasks to parallel thread teamsUtilize separate cores (one per thread)

Instruction level parallelism (Vector)Construct code that vectorizesUtilize vector instructions on commodity

processors

Target same code to GPGPUAll instruction-level parallel loops also parallelize

for a GPGPU target