scalable system for large - sistemas cimne...scalable system for large ... – 10 nodes: 2 x intel...

Scalable System for LargeScalable System for Large Unstructured Mesh Simulation

Miguel A. Pasenau, Pooyan Dadvand,Miguel A. Pasenau, Pooyan Dadvand, Jordi Cotela, Abel Coll and Eugenio

OñateOñate

OverviewOverview• Introduction• Preparation and Simulation

– More Efficient PartitioningMore Efficient Partitioning– Parallel Element Splitting

• Post Processing• Post Processing– Results CacheMerging Many Partitions– Merging Many Partitions

– Memory usageOff screen mode– Off‐screen mode

• Conclusions, Future lines Acknowledgements

16th – 20th May 2011 / 2





• Conclusions, Future lines Acknowledgements

16th – 20th May 2011 / 3

IntroductionIntroduction• Goal: do a CFD simulation with 100 Million elements using in‐house tools

• Hardware: cluster with• Hardware: cluster with– Master node: 2 x Intel Quad Core E5410 and 32 GB RAMRAM

– 3 TB disc with dedicated Gigabit to Master node– 10 nodes: 2 x Intel Quad Core E5410 and 16 GB RAM– 2 nodes: 2 x AMD Opteron Quad Core 2356 and 32 GB– Infiniband 4x DDR

16th – 20th May 2011 / 4

IntroductionIntroduction• Kratos:atos

– Multi‐physics, open source frameworkP ll li d f h d d di t ib t d– Parallelized for shared and distributed memory machines

• GiD:– Geometry handling and data managementy g g– First coarse meshJoining and post processing results– Joining and post‐processing results

16th – 20th May 2011 / 5

IntroductionIntroductionpart 1 res. 1

part 2 res. 2

∙∙

∙∙

Geometry

Conditions

Partition

Distribution

∙

Merge

Visualize

∙

Conditions

MaterialsCoarse

Distribution

Communication plan

part n res. n

Visualize

mesh generation

Refinement

Calculation

16th – 20th May 2011 / 6





• Conclusions, Future lines and Acknowledgements

16th – 20th May 2011 / 7

MeshingMeshing• Single workstation: limited memory and timeS g e o stat o ted e o y a d t e• Three steps:

– Single node: GiD generates a coarse mesh with 13 Million tetrahedrons

– Single node: Kratos + Metis divide and distribute– In parallel: Kratos refines the mesh locallyp y

16th – 20th May 2011 / 8

Efficient partitioning: before

R k0 d th d l titi it d d


Rank0 read the models, partition it and send the partitions to the other ranks

k 0

k 1

Rank

Rank

Rank

2

Rank

3

16th – 20th May 2011 / 9


R k0 d th d l titi it d d


Rank0 read the models, partition it and send the partitions to the other ranks

k 0

k 1

Rank

Rank

Rank

2

Rank

3

16th – 20th May 2011 / 10

Efficient partitioning: before• Requires large memory in node 0

Efficient partitioning: beforeequ es a ge e o y ode 0

• Using the cluster time for partitioning which b d t idcan be done outside

• Each rerun need repartitioningp g

S ki d f O MP d• Same working procedure for OpenMP and MPI run

16th – 20th May 2011 / 11

Efficient partitioning: now

Di idi d iti th titi i

Efficient partitioning: now

Dividing and writing the partitions in another machine Reading data of each rank separately

16th – 20th May 2011 / 12

Local refinement: triangleLocal refinement: trianglek k

i j

mnmn

1

3

4 2k ki jl i l jk

i jl i l j

1 2

i j

m

1

2

l l j j

k

m

k

3

k

i jl

m

i l j

m

1

3

2i l j

m1

32

16th – 20th May 2011 / 13

l j l j

Local refinement: triangleLocal refinement: triangle Selecting the case respecting nodes Idg p g The decision is not for best quality! It is very good for parallelization It is very good for parallelization OpenMP MPI

k

m

k

3

k

i jl

m

i l j

m

1

3

2i l j

m1

32

16th – 20th May 2011 / 14

l j l j

Local refinement: tetrahedronLocal refinement: tetrahedron

Father Element Child ElementsFather Element Child Elements

16th – 20th May 2011 / 15

Local refinement: examplesLocal refinement: examples

16th – 20th May 2011 / 16


16th – 20th May 2011 / 17


16th – 20th May 2011 / 18

Local refinement: uniformLocal refinement: uniform A Uniform refinement can be used to obtain a mesh with 8 times more elements Does not improve the geometry representationDoes not improve the geometry representation

16th – 20th May 2011 / 19

Parallel calculationParallel calculation• Calculated using 12 x 8 MPI processesCa cu ated us g 8 p ocesses• Less than 1 day for 400 time steps• About 180 GB memory usage

• Single volume mesh of 103 Million h d li i 96 fil ( h itetrahedrons split into 96 files ( mesh portion

and its results)

16th – 20th May 2011 / 20






16th – 20th May 2011 / 21

Post‐processPost process• Challenges to face:C a e ges to ace

– Single nodeBi fil t h d d f GB– Big files: tens or hundreds of GB

– Lots of files– Batch post‐processing– Maintain generalityMaintain generality– Fast drawing: DL, V&EA, BO, texturesNice dra ing shado s 3D– Nice drawing: shadows, 3D

16th – 20th May 2011 / 22

Fast drawingFast drawing• OpenGL techniques:Ope G tec ques

– Display lists: compiled list of commands, stored in graphics memorygraphics memory

– Vertex arrays: compact data, fewer calls– Buffer objects: arrays stored in graphics memory– Textures

16th – 20th May 2011 / 23

Fast drawingFast drawing• Useful with graphics hardwareUse u t g ap cs a d a e

Intel QuadCore Q9550 + NVIDIA GTX 275 ( 896MB)

16th – 20th May 2011 / 24

( )

Fast drawingFast drawing• Useful with modest graphics hardwareUse u t odest g ap cs a d a e

Intel QuadCore Q9550 + Intel G45 ( shared memory)

16th – 20th May 2011 / 25

( y)

Fast drawingFast drawing• Useful with no graphics hardwareUse u t o g ap cs a d a e

Intel QuadCore Q9550 + Software Mesa 3D GL

16th – 20th May 2011 / 26

Nice drawing: Stereoscopic mode (3D)g p ( )

16th – 20th May 2011 / 27

Nice drawing: shadowsNice drawing: shadows

16th – 20th May 2011 / 28

Shadows: realism, depth perception of floating objects

Nice drawings: 3D + shadowsNice drawings: 3D shadows

16th – 20th May 2011 / 29

Nice drawings: ShadowsNice drawings: Shadows

16th – 20th May 2011 / 30

Big Files: results cacheBig Files: results cache• Uses a defined memory pool to store results.Uses a de ed e o y poo to sto e esu ts• Used to cache results stored in files.

Mesh Created Results: Temporal

User definableResults from files:

information cuts, extrusions, tcl results

Memory poolsingle, multiple, merge

16th – 20th May 2011 / 31

Big Files: results cacheBig Files: results cacheResult

RC Infomemory footprint

Results cache table

RC entry timestamp

RC info file 1 offset typefile 2 offset type∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙

RC entry timestamp

∙ ∙ ∙ ∙ ∙ ∙ ∙

file n offset typeResultRC info

RC entry timestamp

Open files tablefile handle typefile handle type

ResultRC info

file handle type∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙file handle type

Granularity of result

16th – 20th May 2011 / 32

Big Files: results cacheBig Files: results cache• Verifies result’s file(s) and gets result’s e es esu t s e(s) a d gets esu t sposition in file and memory footprint.R lt f l t t l i t i• Results of latest analysis step in memory.

• Loaded on demand. • Oldest results unloaded if needed.T h• Touch on use.

16th – 20th May 2011 / 33

Big Files: results cacheBig Files: results cache• Chinese harbour:C ese a bou

– 104 GB results file7 6 Milli t t h d– 7.6 Million tetrahedrons

– 2,292 time steps– With 2GB results cache: 3.16 GB memory usage

16th – 20th May 2011 / 34

Merging many partitionsMerging many partitions• Before: 2, 4, ... 10 partitionse o e , , 0 pa t t o s• Now: 32, 64, 128, ... of a single volume mesh• Postpone any calculation:

– Skin extraction– Finding boundary edgesSmoothed normals– Smoothed normals

– Neighbour information– Graphical objects creation

16th – 20th May 2011 / 35

Merging many partitionsMerging many partitionsTelescope example

23,870,544 tetrahedronsBefore 32 partitions 24’ 10”After 32 partitions 4’ 34”

128 partitions 10’ 43”128 partitions 10 43Single file 2’ 16”

16th – 20th May 2011 / 36

Merging many partitionsMerging many partitions

16th – 20th May 2011 / 37

Merging many partitionsMerging many partitionsRacing car example

103,671,344 tetrahedronsBefore 96 partitions > 5 hoursAfter 96 partitions 51’ 21”

Single file 13’ 25”Single file 13 25

16th – 20th May 2011 / 38

Memory usageMemory usage• Around 12 GB of memory used with a spike of ou d G o e o y used t a sp e o15 GB ( MS Windows) 17,5 GB ( Linux), including:including:– Volume mesh ( 103 Mtetras)– Skin mesh ( 6 Mtriangs)– Several surface and cut meshes– Stream line search tree2 GB of results cache– 2 GB of results cache

– Animations

16th – 20th May 2011 / 39

PicturesPictures

16th – 20th May 2011 / 40

PicturesPictures

16th – 20th May 2011 / 41

PicturesPictures

16th – 20th May 2011 / 42

Batch post‐processing: off‐screenBatch post processing: off screen• GiD with no interaction and no window• Command line:

gid ‐offscreen [ WxH] ‐b+g batch file to rungid offscreen [ WxH] b+g batch_file_to_run• Useful to:

l h tl i ti i b i– launch costly animations in bg or in queue– use gid as template generator

id b hi d b Fl h Vid i ti– use gid behind a web server: Flash Video animation• Animation window: added button to generate b h fil f ff id b b hbatch file for offscreen‐gid to be sent to a batch queue.

16th – 20th May 2011 / 43

AnimationAnimation

16th – 20th May 2011 / 44






16th – 20th May 2011 / 45

ConclusionsConclusions• The implemented improvements helped us to e p e e ted p o e e ts e ped us toachieve the milestone:P h l l t d i li CFDPrepare, mesh, calculate and visualize a CFD simulation with 103 Million tetrahedrons

• GiD: also modest machines take profit of these improvementsthese improvements

16th – 20th May 2011 / 46

Future linesFuture lines• Faster tree creation for stream lines.

– Now: ~ 90 s. creation time, 2‐3 s. per stream line• Mesh simplification LOD• Mesh simplification, LOD

– geometry and results criteriaS f h i f t f t d i– Surface meshes, iso‐surfaces, cuts: faster drawing

– Volume meshes: faster cuts, stream lines– Near real‐time

• Parallelize other algorithms in GiD:g– Skin and boundary edges extraction– Parallel cuts and stream lines creation

16th – 20th May 2011 / 47

Parallel cuts and stream lines creation

ChallengesChallenges• 109 – 1010 tetrahedrons, 6∙108 – 6∙109 triangles0 0 tet a ed o s, 6 0 6 0 t a g es• Large workstation with Infiniband to cluster

d 80 GB 800 GB RAM? H d di k?and 80 GB or 800 GB RAM? Hard disk?• Post process as backend of a web server in pcluster? Security issues?

• Post process embedded in solver?• Post process embedded in solver?• Output of both: the original mesh and a simplified one?

16th – 20th May 2011 / 48

AcknowledgementsAcknowledgements• Ministerio de Ciencia e Innovación, E‐DAMS ste o de C e c a e o ac ó , SprojectE C i i R l ti j t• European Commission, Real‐time project

16th – 20th May 2011 / 49

Comments, questions ...Comments, questions ...• ... ?

16th – 20th May 2011 / 50

Thanks for your attention

Scalable System for LargeScalable System for Large Unstructured Mesh Simulation

scalable system for large - sistemas cimne...scalable system for large ... – 10 nodes: 2 x intel...

Documents