guy gueritz oil & gas business development...4 ©bull, 2012 gpu tech conference 2012 – san...

Post on 14-Sep-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Guy Gueritz Oil & Gas Business Development

Mathieu Dubois Senior HPC Consultant

2 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

1. Hybrid Architectures for Seismic Imaging

BULL Profile in HPC Hybrid Architectures Example : Reverse Time Migration

2. Parallel Programming for Hybrid Architectures

GPU Activities at BULL : building an expertise Tools and Programming Environments Numerical methods Scalability

3 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

1. Hybrid Architectures For Seismic

Imaging

Guy GUERITZ

Oil & Gas Business Development

Grenoble Advanced Competency & Services Center

4 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

BCS

BIS

BSS

BIP

€1.35 – 1.45 billion

€1.2 billion

Direct

margin

Direct

margin

EBIT

Indirect

costs

EBIT

€50-60

million

Indirect

costs

Shareholders

Crescendo Ind. 20%

France Télécom 8%

FSI 5%

NEC 2%

Floating 65%

Total 100%

2011 figures

Revenue €1,301 M + 4.6%

Gross margin +4.2%

EBIT +23%

Employees 9,000

Revenue 2010 >

2013

Maint. & PRS

Services

Hardw. & systems

Fulfillment

Critical systems

Profitability 2010 > 2013

5 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

0 50 100 150 200

2007

2008

2009

2010

2011

37

70

98

152

181 180M€+ income in 2011 (w/o maintenance)

Three petaflop-scale systems

- 2010: Tera 100, the first petaflopic system ever designed and developed in Europe, one of the most efficient in its category (84% @ linpack)

- 2010-2011: Genci / Curie (France) - 2 Pflops

- 2011-2012: IFERC – 1.5 Pflops

Other recent key projects - KNMI (Netherlands): meteo

- Barcelona Supercomputing Center (Spain): 186 Tflops (hybrid)

- Société Générale (France) : 350 Tflops

- Dassault Aviation (France) : 100 Tflops

- AWE (UK) : 250 Tflops

Launch of Extreme Factory (HPC pay per use) and Mobull (HPC mobile data center)

- Extreme Factory: Renault, Exa, LL Products, Classified

- Mobull: U_Perpignan, Cenaero

6 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Services

Design

Architecture

Project

Management

Optimisation

supercomputer suite

StoreWay

Hardware platforms

Software environments

Interconnect

Storage systems

Built from standard components, optimized by Bull’s innovation

7 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Structural Mechanics Implicit

Structural Mechanics Explicit

Computational Fluid Dynamics

Electro-Magnetics

Computational Chemistry

Quantum Mechanics

Reservoir Simulation Rendering / Ray Tracing Climate / Weather

Ocean Simulation Data Analytics

Computational Chemistry

Molecular Dynamics Computational Biology

Seismic Processing

8 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

TERA 100

GPU-based extension

198 bullx B505 accelerator

blades

396 NVIDIA® Tesla™

M2090 GPU processors

202,752 GPU cores

CURIE

GPU-based extension

144 bullx B505 accelerator

blades

288 NVIDIA® Tesla™ M2090

GPU processors

147,456 GPU cores

Barcelona

Supercomputing Centre

GPU-based system

126 bullx B505 accelerator

blades

252 NVIDIA® Tesla™ M2090

GPU processors

129,024 GPU cores

9 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Need A super computing system:

to be installed at Petrobras’ new Data Center, at the University Campus of Rio de Janeiro

equipped with GPU accelerator technology

dedicated to the development of new subsurface imaging techniques to support oil exploration and production

Solution

A hybrid architecture coupling 66 general-purpose servers to 66

GPU systems:

66 bullx R422 E2 servers, i.e. 132 compute nodes or 1056 Intel® Xeon® 5500 cores providing a peak performance of 12.4 Tflops

66 NVIDIA® Tesla S1070 GPU systems, i.e. 63360 cores, providing an additional theoretical performance of 246 Tflops

1 bullx R423 E2 service node

Ultra fast InfiniBand QDR interconnect

bullx cluster suite and Red Hat Enterprise Linux

Leader in the Brazilian petrochemical sector,

and one of the largest integrated energy

companies in the world

10 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

11 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

source:

exascale.org

12 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

(Animation courtesy of the Institute of Geophysics in Hamburg)

13 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Forward Pass

• First Recursion – forward in time • Model downgoing wavefield, store snapshots of wavefield at set time

intervals

Backward Pass

• Second Recursion – reverse time • Compute backward extrapolation of wavefield snapshots starting with

receiver data

Correlate Forward + Backward Snapshots

• Apply imaging condition • Correlate forward + backward samples together

14 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Turning waves

Prismatic waves

Diving waves

Strong reflections

Multiples

15 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

3D Gridded Model

- Wave equation discretized into derivatives at set timesteps

- 3D grid size & resolution corresponds to wavelength (max. frequency) & aperture

size

Time Approximation by Finite Differences

- Differential equations transformed into finite difference equations at set timesteps

- Explicit scheme: one element is calculated recursively from several previously

calculated points & timesteps

Fourier Methods

- Transforms between time & frequency domains

- Eliminates some cumulative errors found in FD approximations

16 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Grid size

- Frequency content

- Choice of FD scheme

Aperture

- Too big = too costly

computationally

- Too small = depending on

geology, may miss reflections

Storing downgoing wavefield

- Snapshots

- ‚Virtual receivers‘ at model

boundaries

- Random boundaries

Code parallelization

- 3D loops in OpenMP, CUDA

Domain decomposition

- MPI implemented to fit local

memory

17 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Storing all wavefield snapshots

- Simple method but generates enormous data

- Requires large capacity, fast-access on-node storage

- Node I/O impacts performance

Checkpointing

- Storing pairs of consecutive snapshots at specified time intervals

Storing boundary history only

- Record wavefield at edges & bottom of model

- ‚Virtual‘ receivers

- Recursive calculation, so can regenerate downgoing wavefield

Random boundaries

- Make boundaries random reflectors

- Extrapolate twice, once forward (no storage), once backward (generates downgoing wavefield

18 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Multi-core CPU

sockets

19 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

CPUs connected to RAM via

independent memory channels

20 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

I/O Hub

2 – 4 GPUs per node

21 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

I/O Hub

Local mass storage:

spinning or solid-state

drives

22 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

I/O Hub

Node – node interconnect

23 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Water

cooling bullx S supernodes bullx blades

(B500 series and

DLC B700 series)

bullx R series

Storage

Architectur

e ACCELERATORS

supercomputer suite

24 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

• 2 x Intel Xeon 5600

• 2 x NVIDIA M2090

• 2 x IB QDR

7U

2.1

TF

LO

PS

Embedded Accelerator for high performance with high energy efficiency

25 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Front

view

2 x CPUs 2 x GPUs Double-width blade

2 NVIDIA Tesla M2090 GPUs

2 Intel® Xeon® 5600 quad/hexa-core CPUs

1 dedicated PCI-e 16x connection for each GPU

Double InfiniBand QDR connections between

blades

Exploded

view

26 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

I/O

Controller

Multi GPU System

IB

GBE

GPU GPU GPU GPU

CPU CPU QPI

QPI

PCIe 8x

4GB/s

QPI

westmere EP westmere EP westmere EP

31.2GB/s

12.8GB/s

Each direction

31.2GB/s

IB

PCIe 16x

8GB/s PCIe 8x

4GB/s

IB

PCIe 16x

8GB/s

bullx B505 Accelerator Blade

QPI

I/O

Controller

(Tylersburg)

I/O

Controller

(Tylersburg)

GBE

GPU GPU

27 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

RTM Example: Salt Diapir

Object of study

- Demonstrate imaging

quality of RTM

- Show GPU speedup

Paradigm ECHOS 1.1

- Uses AXE RTM libraries

Multi-client data imaged

with PSDM

28 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

2 cables, 8 streamers

16 * 408 traces

16*3.3 MB = 52.8 MB

Streamer interval 100m

Far offset 5300m

Shot pattern 5000m X 700m

Sub-volume 10 Km x 6 Km x 12 Km

12.5m x 12.5m grid

fmax=25 Hz - fmax=40 Hz

29 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

CDP Grid Inline 2701

30 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Optimum Grid Inline 2701

31 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

CDP Grid Inline 2891

32 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Optimum Grid Inline 2891

33 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Run Times:16 cores (1 node 2 sockets SandyBridge)

25 Hz

30 Hz

35 Hz

40 Hz

OPTIMUM GRID 43 min 1 hour 17 min 2 hours 07 min 3 hours 23 min

CDP GRID 2 hours 19 min 2 hour 41 min 2 hours 58 min 3 hours 28 min

34 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Single Shot Runtime (30Hz)

New B510 Sandybridge blades 16 cores, 4 channels to memory RTM image 2h 41m

B505 Westmere GPU blades 2 x M2090 GPU RTM image 15m

35 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Choice of hybrid architecture depends on several factors

- Algorithm & numerical method employed

- Correlation strategy used (local storage requirements)

- Grid & aperture sizes

- Frequencies involved

- Size of survey

As RTM becomes more generally used, system scalability will be

of critical importance

- Processor & co-processor technologies evolving rapidly

- Software environment maturing

- Economics of hybrid approach gaining hold

36 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

2. Parallel Programming For Hybrid

Architectures

Mathieu DUBOIS

Senior Application Engineer - Hardware Accelerators Expert

Applications & Performance Team

Grenoble Advanced Competency & Services Center

39 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Training Benchmarking

Proof Of Concept /

Code Migration

Code

Optimisation

Activities

Physics, Chemistry,

Biology Oil & Gas

Life Science Security & Finance

Areas

Technology Watch & Performance Evaluation

40 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Barcelona Supercomuting Center

•252 M2090

•103 Tflop Linpack score

•Ranked 114 at TOP500

•Ranked 7 at GREEN500 (#1 in Europe)

GENCI

•288 M2090

•110 Tflop Linpack score

•Ranked 102 at TOP500

•Ranked 8 at GREEN500

CEA - Tera 100

•390 M2090

•154 Tflop Linpack score

•Ranked 75 at TOP500

41 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

42 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

BULL’s expertise in GPU environment is well recognized

43 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

2010 - Premier Prix : Dimitri Komatitsch

SPECFEM3D (geodynamics)

GPU version in development

2009 - Premier Prix : Luigi Genovese

BigDFT (nanosciences)

CUDA & OpenCL version available

Award and Active Development

of Major Scientific Applications

44 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

PGI Accelerator

HMPP

OpenCL

Fortran CUDA

CUDA C

Performance

PGI Accelerator HMPP

Fortran CUDA

CUDA C

OpenCL

Simplicity

45 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

2

2

2

2

2

2

2

2

2

1

z

P

y

P

x

P

t

P

v

Isotropic Wave Equation :

order-k in space stencil (here k is 4)

memory bandwidth bound code

46 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Per thread

local memory

Per bloc

shared

memory

Per GPU

global

memory

thread

block of threads

kernel 1

kernel 2

sequential kern

els

47 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

http://developer.download.nvidia.com/CUDA/CUDA_Zone/papers/gpu_3dfd_rev.pdf

First Approach : 3k +1 elements needed for 1 output value

Better Approach : Some data are being reused for several output values Perform calculation from shared memory latency of shared memory is 2 orders of magnitude

lower than global memory

One order of magnitude increased performance comparing GPU to one CPU core

One can also overlap computation with data transfer

for output wavefield saving.

48 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Code based on an extension of the pseudo spectral method called the pseudo-analytic model

Modifies the Fourier Transform of the Laplacian operator

correcting the propagation errors from the finite differences scheme

Obtain nearly non-dispersive wave propagation

Original source code in Fortran 90 using OpenMP and MKL FFTs

One shot per node

Obvious Hot Spots :

FFTs

Laplacian

49 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

0

50

100

150

200

250

300

350

400

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 FFTs

Tim

e (

se

c)

85 % of the time spent in one subroutine

In this subroutine 6 kernels are identified :

kernel1 : 31 % kernel2 : 13 % kernel3 : 2 % kernel4 : 1 % kernel5 : 18 %

FFT : 35%

50 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

bullx b505 server with : 2 Intel Westmere 4 cores processors @ 2,67 GHz 24 GB DDR3@1333 MHz 2 NVIDIA M2090 GPUs

Software and tools NVIDIA CUDA 4.1

Intel Compilers version 12 and Intel MPI 4 PGI compilers 11

Use CUFFT Library and write call wrappers

Write a CUDA kernel for each of the 5 subroutine kernels (avoid transfers)

Compare CUDA C , Fortran CUDA, HMPP

51 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

Simplified Fortran porting

No need for Fortran C CUDA interfaces

No problem with conversion of unit memory stride access in multidimensional arrays API simplified and identical to Fortran 90

!Define variables on CPUs

real, pinned, allocatable, dimension(:,:,:) :: A_host

!Define variables on GPUs

real, device, allocatable, dimension(:,:,:) :: A_device

!allocate them in a single call

allocate( A_host(nx,ny,nz), A_device(nx,ny,nz) )

!transfer data between CPU/GPU

A_device = A_host

Same Performance between

CUDA C & Fortran CUDA

53 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

4x Speed Up between

1 M2090 and 8 Xeon cores

Data transfers are reduced to 1 second

0

50

100

150

200

250

300

350

400

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 FFTs

Tim

e (

sec)

8 cores Xeon

M2090 GPU

54 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

RTM are embarrassingly parallel applications (over shots)

On standard CPU servers : compute 1 shot per node take advantage of all the CPU cores for full MKL FFT performance Overall performance will increase with new generation processors

On GPU servers : 4x speed up for one shot using one GPU Compute 1 shot per GPU available on the server Diminish the number of servers by 2 for same speed up Keep the same number of servers but with double speed up

Small data set for benchmarking Problem size may be too big to fit in today GPU memory

55 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

BULL has built his expertise on real customer requests : Trainings

POC in Oil & Gas, Finance, Life Science, Material Science Advice for cluster architecture definition Pro-activity

BULL expertise is recognized: Successful POC with significant speed up and cost reduction

Acknowledgment in scientific publication Help with code migration and optimization

56 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

top related