lbm based flow simulation using gpu ...fdubois/...lbm based flow simulation using gpu computing...

21
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR CETHIL UMR5008 GPU COMPUTING PROCESSOR Frédéric Kuznik , [email protected] 1 LBE Workshop -15/05/2008 - CEMAGREF Antony

Upload: others

Post on 03-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • LBM BASED FLOW SIMULATION USING

    GPU COMPUTING PROCESSOR

    CETHILUMR5008

    GPU COMPUTING PROCESSOR

    Frédéric Kuznik, [email protected]

    11LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Framework

    Introduction

    Hardware architecture

    CUDA overviewCUDA overview

    Implementation details

    A simple case: the lid driven cavity problem

    22LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Presentation of my activities

    CETHIL: Thermal Sciences Center of Lyon

    Current work: simulation of fluid flows with

    heat transfer

    CETHILUMR5008

    33

    Use/development of Thermal LBM1,2 LBM computation using GPU3

    Objective: coupling efficiently the two approaches

    1 A double population lattice Boltzman method with non-uniform mesh for the simulation of natural convection in a

    square cavity, Int. J. Heat and Fluid Flow, vol. 28(5), pp. 862-870, 2007

    2 Numerical prediction of natural convection occuring in building components: a double population lattice Boltzmann

    method, Numerical Heat Transfer A, vol. 52(4), pp. 315-335, 2007

    3 LBM based flow simulation using GPU computing processor, Computers & Mathematics with Applications, under

    reviewLBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Introduction

    Due to the market (realtime and

    high definition 3-D graphics), GPU

    has evolved into highly parrallel,

    multithreaded, multi-core

    44

    (from CUDA Programming Guide 06/07/2008)

    GFLOPS=109 floating point operations per second

    multithreaded, multi-core

    processor.

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Introduction

    The bandwidth of GPU has also

    evolved for faster graphics

    purpose.

    55

    (from CUDA Programming Guide 06/07/2008)

    purpose.

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Previous works using GPU & LBM

    GPU Cluster for high performance computing (Fan et al.

    2004): 32 nodes cluster of GeForce 5800 ultra

    LB-Stream Computing or over 1 Billion Lattice Updates per

    second on a single PC (J. Tölke ICMMES 2007): GeForce 8800

    ultra

    66

    ultra

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Hardware

    Commercial graphics card (can beincluded in a desktop PC)

    240 processors running at1.35GHz 1.35GHz

    1 Go DDR3 with 141.7GB/s bandwidth

    Theoretical peak at 1000GFLOPS

    Price around 500€

    77

    nVIDIA GeForce GTX 280

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • GPU Hardware Architecture

    ◊ 15 texture processors (TPC)

    ◊ 1 TPC is composed of 2

    streaming multiprocessors (SM)

    ◊ 1 SM is composed of 8

    88

    ◊ 1 SM is composed of 8

    streaming processors (SP) and 2

    special functions units

    ◊ Each SM has 16kB of shared

    memory

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • GPU Programming InterfaceNVIDIA® CUDATM C language programming environment

    (version 2.0)

    Definitions:

    Device=GPU

    Host=CPU

    Kernel=function that is called from the host and runs on

    99

    Kernel=function that is called from the host and runs onthe device

    A cuda kernel is exucuted by an array of threads

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • GPU Programming Interface

    A kernel is executed by a gridof thread block

    1 thread block is executed by 1multiprocessor

    A thread block contains a

    1010

    A thread block contains amaximum of 1024 threads

    Threads are executed byprocessors within a singlemultiprocessor

    => Threads from differentblocks cannot cooperate

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Kernel memory access

    Registers

    Shared memory: on-chip (fast),

    small (16kB)

    Global memory: off-chip, large

    1111

    Global memory: off-chip, large

    The host can read or write

    global memory only.

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • LBM ALGORITHM

    Step 1: Collision (Local – 70% computational time)

    Step 2: Propagation (Neighbors – 28% computational time)

    1212

    Step 2: Propagation (Neighbors – 28% computational time)

    + Boundary conditions

    (from BERNSDORF J., How to make my LB-code faster – software planning,

    implementation and performance tuning, ICMMES’08, Netherlands)

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Implementation details

    Thread block

    1313

    Grid of blocksLattice grid

    1- Decomposition of the fluid domain into a lattice grid

    2- Indexing the lattice nodes using thread ID

    3- Decomposition of the CUDA domain into thread blocks

    4- Execution of the kernel by the thread blocks

    Translation of lattice grid

    into CUDA topology

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Pseudo-code

    Combine collision and propagation steps:

    for each thread block

    for each thread

    load fi in shared memory

    1414

    i

    compute collision step

    do the propagation step

    end

    end

    Exchange informations across boundaries

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Application: 2D lid driven cavity

    Problem description u=U0, v=0

    u=0, v=0u=0, v=0

    D2Q9 MRT model of d’Humières (1992)

    1515

    y

    x

    u=0, v=0u=0, v=0

    u=0, v=0

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Results Re=1000

    Reference data from Erturk et al. 2005:

    0.1

    0.2

    0.3

    0.7

    0.8

    0.9

    1

    1616

    xv

    velo

    city

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5

    -0.4

    -0.3

    -0.2

    -0.1

    0

    0.1

    data from Erturk et al.LBM

    u velocity

    y

    -0.25 0 0.25 0.5 0.75 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    data from Erturk et al.LBM

    Vertical velocity profil for x=0.5 Horizontal velocity profil for y=0.5

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Performances

    Mesh

    grid size

    Number of Threads

    16 32 64 128 256

    256² 71.0 71.0 71.1 71.1 70.9

    512² 284.2 282.7 284.3 284.3 283.7

    1717

    512² 284.2 282.7 284.3 284.3 283.7

    1024² 346.4 690.2 953.0 997.5 1007.9

    2048² 240.0 583.0 803.3 961.3 1001.5

    3072² 289.3 572.4 845.2 941.2 974.2

    Performances in MLUPS

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Performances: CPU comparisons

    80

    100

    120

    140

    160

    180

    200

    256²

    512²

    1024²

    GP

    U/C

    PU

    1818

    0

    20

    40

    60

    80

    0 128 256

    1024²

    2048²

    3072²

    Number of Threads per block

    GP

    U/C

    PU

    CPU = Pentium IV, 3.0 GHz

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Conclusions

    The GPU allows a gain of about 180 for the calculation time

    Multiple GPU server exists: NVIDIA TESLA S1070 composed of

    1919

    4 GTX 280 (theoretical gain 720)

    The evolution of GPU is not finished !

    But using double precision floating point, the gain falls to 20 !

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • Outlooks

    A workgroup concerning LBM and GPU

    1 master student with Prof. Bernard TOURANCHEAU (ENS Lyon –

    INRIA) and a PhD student in 2009

    1 master student with Prof. Eric Favier (ENISE – DIPI)

    2020

    A workgroup concerning Thermal LBM …?

    Everybody is welcome to join the workgroup !

    x

    y

    0 0.5 10

    0.5

    1

    x

    y

    0 0.5 10

    0.5

    1

    x

    y

    0 0.5 10

    0.5

    1

    Ra=103 Ra=104

    Ra=105 Ra=106x

    y

    0 0.5 10

    0.5

    1

    LBE Workshop - 15/05/2008 - CEMAGREF Antony

  • THANK YOU FOR YOUR ATTENTION

    2121

    THANK YOU FOR YOUR ATTENTION

    [email protected]

    LBE Workshop - 15/05/2008 - CEMAGREF Antony