scalable system for large - sistemas cimne...scalable system for large ... – 10 nodes: 2 x intel...
TRANSCRIPT
Scalable System for LargeScalable System for Large Unstructured Mesh Simulation
Miguel A. Pasenau, Pooyan Dadvand,Miguel A. Pasenau, Pooyan Dadvand, Jordi Cotela, Abel Coll and Eugenio
OñateOñate
OverviewOverview• Introduction• Preparation and Simulation
– More Efficient PartitioningMore Efficient Partitioning– Parallel Element Splitting
• Post Processing• Post Processing– Results CacheMerging Many Partitions– Merging Many Partitions
– Memory usageOff screen mode– Off‐screen mode
• Conclusions, Future lines Acknowledgements
16th – 20th May 2011 / 2
OverviewOverview• Introduction• Preparation and Simulation
– More Efficient PartitioningMore Efficient Partitioning– Parallel Element Splitting
• Post Processing• Post Processing– Results CacheMerging Many Partitions– Merging Many Partitions
– Memory usageOff screen mode– Off‐screen mode
• Conclusions, Future lines Acknowledgements
16th – 20th May 2011 / 3
IntroductionIntroduction• Goal: do a CFD simulation with 100 Million elements using in‐house tools
• Hardware: cluster with• Hardware: cluster with– Master node: 2 x Intel Quad Core E5410 and 32 GB RAMRAM
– 3 TB disc with dedicated Gigabit to Master node– 10 nodes: 2 x Intel Quad Core E5410 and 16 GB RAM– 2 nodes: 2 x AMD Opteron Quad Core 2356 and 32 GB– Infiniband 4x DDR
16th – 20th May 2011 / 4
IntroductionIntroduction• Kratos:atos
– Multi‐physics, open source frameworkP ll li d f h d d di t ib t d– Parallelized for shared and distributed memory machines
• GiD:– Geometry handling and data managementy g g– First coarse meshJoining and post processing results– Joining and post‐processing results
16th – 20th May 2011 / 5
IntroductionIntroductionpart 1 res. 1
part 2 res. 2
∙∙
∙∙
Geometry
Conditions
Partition
Distribution
∙
Merge
Visualize
∙
Conditions
MaterialsCoarse
Distribution
Communication plan
part n res. n
Visualize
mesh generation
Refinement
Calculation
16th – 20th May 2011 / 6
OverviewOverview• Introduction• Preparation and Simulation
– More Efficient PartitioningMore Efficient Partitioning– Parallel Element Splitting
• Post Processing• Post Processing– Results CacheMerging Many Partitions– Merging Many Partitions
– Memory usageOff screen mode– Off‐screen mode
• Conclusions, Future lines and Acknowledgements
16th – 20th May 2011 / 7
MeshingMeshing• Single workstation: limited memory and timeS g e o stat o ted e o y a d t e• Three steps:
– Single node: GiD generates a coarse mesh with 13 Million tetrahedrons
– Single node: Kratos + Metis divide and distribute– In parallel: Kratos refines the mesh locallyp y
16th – 20th May 2011 / 8
Efficient partitioning: before
R k0 d th d l titi it d d
Efficient partitioning: before
Rank0 read the models, partition it and send the partitions to the other ranks
k 0
k 1
Rank
Rank
Rank
2
Rank
3
16th – 20th May 2011 / 9
Efficient partitioning: before
R k0 d th d l titi it d d
Efficient partitioning: before
Rank0 read the models, partition it and send the partitions to the other ranks
k 0
k 1
Rank
Rank
Rank
2
Rank
3
16th – 20th May 2011 / 10
Efficient partitioning: before• Requires large memory in node 0
Efficient partitioning: beforeequ es a ge e o y ode 0
• Using the cluster time for partitioning which b d t idcan be done outside
• Each rerun need repartitioningp g
S ki d f O MP d• Same working procedure for OpenMP and MPI run
16th – 20th May 2011 / 11
Efficient partitioning: now
Di idi d iti th titi i
Efficient partitioning: now
Dividing and writing the partitions in another machine Reading data of each rank separately
16th – 20th May 2011 / 12
Local refinement: triangleLocal refinement: trianglek k
i j
mnmn
1
3
4 2k ki jl i l jk
i jl i l j
1 2
i j
m
1
2
l l j j
k
m
k
3
k
i jl
m
i l j
m
1
3
2i l j
m1
32
16th – 20th May 2011 / 13
l j l j
Local refinement: triangleLocal refinement: triangle Selecting the case respecting nodes Idg p g The decision is not for best quality! It is very good for parallelization It is very good for parallelization OpenMP MPI
k
m
k
3
k
i jl
m
i l j
m
1
3
2i l j
m1
32
16th – 20th May 2011 / 14
l j l j
Local refinement: tetrahedronLocal refinement: tetrahedron
Father Element Child ElementsFather Element Child Elements
16th – 20th May 2011 / 15
Local refinement: examplesLocal refinement: examples
16th – 20th May 2011 / 16
Local refinement: examplesLocal refinement: examples
16th – 20th May 2011 / 17
Local refinement: examplesLocal refinement: examples
16th – 20th May 2011 / 18
Local refinement: uniformLocal refinement: uniform A Uniform refinement can be used to obtain a mesh with 8 times more elements Does not improve the geometry representationDoes not improve the geometry representation
16th – 20th May 2011 / 19
Parallel calculationParallel calculation• Calculated using 12 x 8 MPI processesCa cu ated us g 8 p ocesses• Less than 1 day for 400 time steps• About 180 GB memory usage
• Single volume mesh of 103 Million h d li i 96 fil ( h itetrahedrons split into 96 files ( mesh portion
and its results)
16th – 20th May 2011 / 20
OverviewOverview• Introduction• Preparation and Simulation
– More Efficient PartitioningMore Efficient Partitioning– Parallel Element Splitting
• Post Processing• Post Processing– Results CacheMerging Many Partitions– Merging Many Partitions
– Memory usageOff screen mode– Off‐screen mode
• Conclusions, Future lines and Acknowledgements
16th – 20th May 2011 / 21
Post‐processPost process• Challenges to face:C a e ges to ace
– Single nodeBi fil t h d d f GB– Big files: tens or hundreds of GB
– Lots of files– Batch post‐processing– Maintain generalityMaintain generality– Fast drawing: DL, V&EA, BO, texturesNice dra ing shado s 3D– Nice drawing: shadows, 3D
16th – 20th May 2011 / 22
Fast drawingFast drawing• OpenGL techniques:Ope G tec ques
– Display lists: compiled list of commands, stored in graphics memorygraphics memory
– Vertex arrays: compact data, fewer calls– Buffer objects: arrays stored in graphics memory– Textures
16th – 20th May 2011 / 23
Fast drawingFast drawing• Useful with graphics hardwareUse u t g ap cs a d a e
Intel QuadCore Q9550 + NVIDIA GTX 275 ( 896MB)
16th – 20th May 2011 / 24
( )
Fast drawingFast drawing• Useful with modest graphics hardwareUse u t odest g ap cs a d a e
Intel QuadCore Q9550 + Intel G45 ( shared memory)
16th – 20th May 2011 / 25
( y)
Fast drawingFast drawing• Useful with no graphics hardwareUse u t o g ap cs a d a e
Intel QuadCore Q9550 + Software Mesa 3D GL
16th – 20th May 2011 / 26
Nice drawing: Stereoscopic mode (3D)g p ( )
16th – 20th May 2011 / 27
Nice drawing: shadowsNice drawing: shadows
16th – 20th May 2011 / 28
Shadows: realism, depth perception of floating objects
Nice drawings: 3D + shadowsNice drawings: 3D shadows
16th – 20th May 2011 / 29
Nice drawings: ShadowsNice drawings: Shadows
16th – 20th May 2011 / 30
Big Files: results cacheBig Files: results cache• Uses a defined memory pool to store results.Uses a de ed e o y poo to sto e esu ts• Used to cache results stored in files.
Mesh Created Results: Temporal
User definableResults from files:
information cuts, extrusions, tcl results
Memory poolsingle, multiple, merge
16th – 20th May 2011 / 31
Big Files: results cacheBig Files: results cacheResult
RC Infomemory footprint
Results cache table
RC entry timestamp
RC info file 1 offset typefile 2 offset type∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙
RC entry timestamp
∙ ∙ ∙ ∙ ∙ ∙ ∙
file n offset typeResultRC info
RC entry timestamp
Open files tablefile handle typefile handle type
ResultRC info
file handle type∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙file handle type
Granularity of result
16th – 20th May 2011 / 32
Big Files: results cacheBig Files: results cache• Verifies result’s file(s) and gets result’s e es esu t s e(s) a d gets esu t sposition in file and memory footprint.R lt f l t t l i t i• Results of latest analysis step in memory.
• Loaded on demand. • Oldest results unloaded if needed.T h• Touch on use.
16th – 20th May 2011 / 33
Big Files: results cacheBig Files: results cache• Chinese harbour:C ese a bou
– 104 GB results file7 6 Milli t t h d– 7.6 Million tetrahedrons
– 2,292 time steps– With 2GB results cache: 3.16 GB memory usage
16th – 20th May 2011 / 34
Merging many partitionsMerging many partitions• Before: 2, 4, ... 10 partitionse o e , , 0 pa t t o s• Now: 32, 64, 128, ... of a single volume mesh• Postpone any calculation:
– Skin extraction– Finding boundary edgesSmoothed normals– Smoothed normals
– Neighbour information– Graphical objects creation
16th – 20th May 2011 / 35
Merging many partitionsMerging many partitionsTelescope example
23,870,544 tetrahedronsBefore 32 partitions 24’ 10”After 32 partitions 4’ 34”
128 partitions 10’ 43”128 partitions 10 43Single file 2’ 16”
16th – 20th May 2011 / 36
Merging many partitionsMerging many partitions
16th – 20th May 2011 / 37
Merging many partitionsMerging many partitionsRacing car example
103,671,344 tetrahedronsBefore 96 partitions > 5 hoursAfter 96 partitions 51’ 21”
Single file 13’ 25”Single file 13 25
16th – 20th May 2011 / 38
Memory usageMemory usage• Around 12 GB of memory used with a spike of ou d G o e o y used t a sp e o15 GB ( MS Windows) 17,5 GB ( Linux), including:including:– Volume mesh ( 103 Mtetras)– Skin mesh ( 6 Mtriangs)– Several surface and cut meshes– Stream line search tree2 GB of results cache– 2 GB of results cache
– Animations
16th – 20th May 2011 / 39
PicturesPictures
16th – 20th May 2011 / 40
PicturesPictures
16th – 20th May 2011 / 41
PicturesPictures
16th – 20th May 2011 / 42
Batch post‐processing: off‐screenBatch post processing: off screen• GiD with no interaction and no window• Command line:
gid ‐offscreen [ WxH] ‐b+g batch file to rungid offscreen [ WxH] b+g batch_file_to_run• Useful to:
l h tl i ti i b i– launch costly animations in bg or in queue– use gid as template generator
id b hi d b Fl h Vid i ti– use gid behind a web server: Flash Video animation• Animation window: added button to generate b h fil f ff id b b hbatch file for offscreen‐gid to be sent to a batch queue.
16th – 20th May 2011 / 43
AnimationAnimation
16th – 20th May 2011 / 44
OverviewOverview• Introduction• Preparation and Simulation
– More Efficient PartitioningMore Efficient Partitioning– Parallel Element Splitting
• Post Processing• Post Processing– Results CacheMerging Many Partitions– Merging Many Partitions
– Memory usageOff screen mode– Off‐screen mode
• Conclusions, Future lines and Acknowledgements
16th – 20th May 2011 / 45
ConclusionsConclusions• The implemented improvements helped us to e p e e ted p o e e ts e ped us toachieve the milestone:P h l l t d i li CFDPrepare, mesh, calculate and visualize a CFD simulation with 103 Million tetrahedrons
• GiD: also modest machines take profit of these improvementsthese improvements
16th – 20th May 2011 / 46
Future linesFuture lines• Faster tree creation for stream lines.
– Now: ~ 90 s. creation time, 2‐3 s. per stream line• Mesh simplification LOD• Mesh simplification, LOD
– geometry and results criteriaS f h i f t f t d i– Surface meshes, iso‐surfaces, cuts: faster drawing
– Volume meshes: faster cuts, stream lines– Near real‐time
• Parallelize other algorithms in GiD:g– Skin and boundary edges extraction– Parallel cuts and stream lines creation
16th – 20th May 2011 / 47
Parallel cuts and stream lines creation
ChallengesChallenges• 109 – 1010 tetrahedrons, 6∙108 – 6∙109 triangles0 0 tet a ed o s, 6 0 6 0 t a g es• Large workstation with Infiniband to cluster
d 80 GB 800 GB RAM? H d di k?and 80 GB or 800 GB RAM? Hard disk?• Post process as backend of a web server in pcluster? Security issues?
• Post process embedded in solver?• Post process embedded in solver?• Output of both: the original mesh and a simplified one?
16th – 20th May 2011 / 48
AcknowledgementsAcknowledgements• Ministerio de Ciencia e Innovación, E‐DAMS ste o de C e c a e o ac ó , SprojectE C i i R l ti j t• European Commission, Real‐time project
16th – 20th May 2011 / 49
Comments, questions ...Comments, questions ...• ... ?
16th – 20th May 2011 / 50
Thanks for your attention
Scalable System for LargeScalable System for Large Unstructured Mesh Simulation