guy gueritz oil & gas business development...4 ©bull, 2012 gpu tech conference 2012 – san...

Guy Gueritz Oil & Gas Business Development

Mathieu Dubois Senior HPC Consultant

2 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

1. Hybrid Architectures for Seismic Imaging

BULL Profile in HPC Hybrid Architectures Example : Reverse Time Migration

2. Parallel Programming for Hybrid Architectures

GPU Activities at BULL : building an expertise Tools and Programming Environments Numerical methods Scalability

1. Hybrid Architectures For Seismic

Imaging

Guy GUERITZ

Oil & Gas Business Development

Grenoble Advanced Competency & Services Center

€1.35 – 1.45 billion

€1.2 billion

Direct

margin

Direct

margin

Indirect

€50-60

million

Indirect

Shareholders

Crescendo Ind. 20%

France Télécom 8%

FSI 5%

NEC 2%

Floating 65%

Total 100%

2011 figures

Revenue €1,301 M + 4.6%

Gross margin +4.2%

EBIT +23%

Employees 9,000

Revenue 2010 >

Maint. & PRS

Services

Hardw. & systems

Fulfillment

Critical systems

Profitability 2010 > 2013

0 50 100 150 200

181 180M€+ income in 2011 (w/o maintenance)

Three petaflop-scale systems

- 2010: Tera 100, the first petaflopic system ever designed and developed in Europe, one of the most efficient in its category (84% @ linpack)

- 2010-2011: Genci / Curie (France) - 2 Pflops

- 2011-2012: IFERC – 1.5 Pflops

Other recent key projects - KNMI (Netherlands): meteo

- Barcelona Supercomputing Center (Spain): 186 Tflops (hybrid)

- Société Générale (France) : 350 Tflops

- Dassault Aviation (France) : 100 Tflops

- AWE (UK) : 250 Tflops

Launch of Extreme Factory (HPC pay per use) and Mobull (HPC mobile data center)

- Extreme Factory: Renault, Exa, LL Products, Classified

- Mobull: U_Perpignan, Cenaero

Services

Design

Architecture

Project

Management

Optimisation

supercomputer suite

StoreWay

Hardware platforms

Software environments

Interconnect

Storage systems

Built from standard components, optimized by Bull’s innovation

Structural Mechanics Implicit

Structural Mechanics Explicit

Computational Fluid Dynamics

Electro-Magnetics

Computational Chemistry

Quantum Mechanics

Reservoir Simulation Rendering / Ray Tracing Climate / Weather

Ocean Simulation Data Analytics

Computational Chemistry

Molecular Dynamics Computational Biology

Seismic Processing

TERA 100

GPU-based extension

198 bullx B505 accelerator

blades

396 NVIDIA® Tesla™

M2090 GPU processors

202,752 GPU cores

GPU-based extension

blades

288 NVIDIA® Tesla™ M2090

GPU processors

147,456 GPU cores

Barcelona

Supercomputing Centre

GPU-based system

blades

252 NVIDIA® Tesla™ M2090

GPU processors

129,024 GPU cores

Need A super computing system:

to be installed at Petrobras’ new Data Center, at the University Campus of Rio de Janeiro

equipped with GPU accelerator technology

dedicated to the development of new subsurface imaging techniques to support oil exploration and production

Solution

A hybrid architecture coupling 66 general-purpose servers to 66

GPU systems:

66 bullx R422 E2 servers, i.e. 132 compute nodes or 1056 Intel® Xeon® 5500 cores providing a peak performance of 12.4 Tflops

66 NVIDIA® Tesla S1070 GPU systems, i.e. 63360 cores, providing an additional theoretical performance of 246 Tflops

1 bullx R423 E2 service node

Ultra fast InfiniBand QDR interconnect

bullx cluster suite and Red Hat Enterprise Linux

Leader in the Brazilian petrochemical sector,

and one of the largest integrated energy

companies in the world

source:

exascale.org

(Animation courtesy of the Institute of Geophysics in Hamburg)

Forward Pass

• First Recursion – forward in time • Model downgoing wavefield, store snapshots of wavefield at set time

intervals

Backward Pass

• Second Recursion – reverse time • Compute backward extrapolation of wavefield snapshots starting with

receiver data

Correlate Forward + Backward Snapshots

• Apply imaging condition • Correlate forward + backward samples together

Turning waves

Prismatic waves

Diving waves

Strong reflections

Multiples

3D Gridded Model

- Wave equation discretized into derivatives at set timesteps

- 3D grid size & resolution corresponds to wavelength (max. frequency) & aperture

Time Approximation by Finite Differences

- Differential equations transformed into finite difference equations at set timesteps

- Explicit scheme: one element is calculated recursively from several previously

calculated points & timesteps

Fourier Methods

- Transforms between time & frequency domains

- Eliminates some cumulative errors found in FD approximations

Grid size

- Frequency content

- Choice of FD scheme

Aperture

- Too big = too costly

computationally

- Too small = depending on

geology, may miss reflections

Storing downgoing wavefield

- Snapshots

- ‚Virtual receivers‘ at model

boundaries

- Random boundaries

Code parallelization

- 3D loops in OpenMP, CUDA

Domain decomposition

- MPI implemented to fit local

memory

Storing all wavefield snapshots

- Simple method but generates enormous data

- Requires large capacity, fast-access on-node storage

- Node I/O impacts performance

Checkpointing

- Storing pairs of consecutive snapshots at specified time intervals

Storing boundary history only

- Record wavefield at edges & bottom of model

- ‚Virtual‘ receivers

- Recursive calculation, so can regenerate downgoing wavefield

Random boundaries

- Make boundaries random reflectors

- Extrapolate twice, once forward (no storage), once backward (generates downgoing wavefield

Multi-core CPU

sockets

CPUs connected to RAM via

independent memory channels

I/O Hub

2 – 4 GPUs per node

I/O Hub

Local mass storage:

spinning or solid-state

drives

I/O Hub

Node – node interconnect

cooling bullx S supernodes bullx blades

(B500 series and

DLC B700 series)

bullx R series

Storage

Architectur

e ACCELERATORS

supercomputer suite

• 2 x Intel Xeon 5600

• 2 x NVIDIA M2090

• 2 x IB QDR

Embedded Accelerator for high performance with high energy efficiency

2 x CPUs 2 x GPUs Double-width blade

2 NVIDIA Tesla M2090 GPUs

2 Intel® Xeon® 5600 quad/hexa-core CPUs

1 dedicated PCI-e 16x connection for each GPU

Double InfiniBand QDR connections between

blades

Exploded

Controller

Multi GPU System

GPU GPU GPU GPU

CPU CPU QPI

PCIe 8x

westmere EP westmere EP westmere EP

31.2GB/s

12.8GB/s

Each direction

31.2GB/s

PCIe 16x

8GB/s PCIe 8x

PCIe 16x

bullx B505 Accelerator Blade

Controller

(Tylersburg)

Controller

(Tylersburg)

GPU GPU

RTM Example: Salt Diapir

Object of study

- Demonstrate imaging

quality of RTM

- Show GPU speedup

Paradigm ECHOS 1.1

- Uses AXE RTM libraries

Multi-client data imaged

with PSDM

2 cables, 8 streamers

16 * 408 traces

16*3.3 MB = 52.8 MB

Streamer interval 100m

Far offset 5300m

Shot pattern 5000m X 700m

Sub-volume 10 Km x 6 Km x 12 Km

12.5m x 12.5m grid

fmax=25 Hz - fmax=40 Hz

CDP Grid Inline 2701

Optimum Grid Inline 2701

CDP Grid Inline 2891

Optimum Grid Inline 2891

Run Times:16 cores (1 node 2 sockets SandyBridge)

OPTIMUM GRID 43 min 1 hour 17 min 2 hours 07 min 3 hours 23 min

CDP GRID 2 hours 19 min 2 hour 41 min 2 hours 58 min 3 hours 28 min

Single Shot Runtime (30Hz)

New B510 Sandybridge blades 16 cores, 4 channels to memory RTM image 2h 41m

B505 Westmere GPU blades 2 x M2090 GPU RTM image 15m

Choice of hybrid architecture depends on several factors

- Algorithm & numerical method employed

- Correlation strategy used (local storage requirements)

- Grid & aperture sizes

- Frequencies involved

- Size of survey

As RTM becomes more generally used, system scalability will be

of critical importance

- Processor & co-processor technologies evolving rapidly

- Software environment maturing

- Economics of hybrid approach gaining hold

2. Parallel Programming For Hybrid

Architectures

Mathieu DUBOIS

Senior Application Engineer - Hardware Accelerators Expert

Applications & Performance Team

Grenoble Advanced Competency & Services Center

Training Benchmarking

Proof Of Concept /

Code Migration

Optimisation

Activities

Physics, Chemistry,

Biology Oil & Gas

Life Science Security & Finance

Technology Watch & Performance Evaluation

Barcelona Supercomuting Center

•252 M2090

•103 Tflop Linpack score

•Ranked 114 at TOP500

•Ranked 7 at GREEN500 (#1 in Europe)

•288 M2090

•Ranked 8 at GREEN500

CEA - Tera 100

•390 M2090

BULL’s expertise in GPU environment is well recognized

2010 - Premier Prix : Dimitri Komatitsch

SPECFEM3D (geodynamics)

GPU version in development

2009 - Premier Prix : Luigi Genovese

BigDFT (nanosciences)

CUDA & OpenCL version available

Award and Active Development

of Major Scientific Applications

PGI Accelerator

OpenCL

Fortran CUDA

CUDA C

Performance

PGI Accelerator HMPP

Fortran CUDA

CUDA C

OpenCL

Simplicity

Isotropic Wave Equation :

order-k in space stencil (here k is 4)

memory bandwidth bound code

Per thread

local memory

Per bloc

shared

memory

Per GPU

global

memory

thread

block of threads

kernel 1

kernel 2

sequential kern

http://developer.download.nvidia.com/CUDA/CUDA_Zone/papers/gpu_3dfd_rev.pdf

First Approach : 3k +1 elements needed for 1 output value

Better Approach : Some data are being reused for several output values Perform calculation from shared memory latency of shared memory is 2 orders of magnitude

lower than global memory

One order of magnitude increased performance comparing GPU to one CPU core

One can also overlap computation with data transfer

for output wavefield saving.

Code based on an extension of the pseudo spectral method called the pseudo-analytic model

Modifies the Fourier Transform of the Laplacian operator

correcting the propagation errors from the finite differences scheme

Obtain nearly non-dispersive wave propagation

Original source code in Fortran 90 using OpenMP and MKL FFTs

One shot per node

Obvious Hot Spots :

Laplacian

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 FFTs

85 % of the time spent in one subroutine

In this subroutine 6 kernels are identified :

kernel1 : 31 % kernel2 : 13 % kernel3 : 2 % kernel4 : 1 % kernel5 : 18 %

FFT : 35%

bullx b505 server with : 2 Intel Westmere 4 cores processors @ 2,67 GHz 24 GB DDR3@1333 MHz 2 NVIDIA M2090 GPUs

Software and tools NVIDIA CUDA 4.1

Intel Compilers version 12 and Intel MPI 4 PGI compilers 11

Use CUFFT Library and write call wrappers

Write a CUDA kernel for each of the 5 subroutine kernels (avoid transfers)

Compare CUDA C , Fortran CUDA, HMPP

Simplified Fortran porting

No need for Fortran C CUDA interfaces

No problem with conversion of unit memory stride access in multidimensional arrays API simplified and identical to Fortran 90

!Define variables on CPUs

real, pinned, allocatable, dimension(:,:,:) :: A_host

!Define variables on GPUs

real, device, allocatable, dimension(:,:,:) :: A_device

!allocate them in a single call

allocate( A_host(nx,ny,nz), A_device(nx,ny,nz) )

!transfer data between CPU/GPU

A_device = A_host

Same Performance between

CUDA C & Fortran CUDA

4x Speed Up between

1 M2090 and 8 Xeon cores

Data transfers are reduced to 1 second

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 FFTs

8 cores Xeon

M2090 GPU

RTM are embarrassingly parallel applications (over shots)

On standard CPU servers : compute 1 shot per node take advantage of all the CPU cores for full MKL FFT performance Overall performance will increase with new generation processors

On GPU servers : 4x speed up for one shot using one GPU Compute 1 shot per GPU available on the server Diminish the number of servers by 2 for same speed up Keep the same number of servers but with double speed up

Small data set for benchmarking Problem size may be too big to fit in today GPU memory

BULL has built his expertise on real customer requests : Trainings

POC in Oil & Gas, Finance, Life Science, Material Science Advice for cluster architecture definition Pro-activity

BULL expertise is recognized: Successful POC with significant speed up and cost reduction

Acknowledgment in scientific publication Help with code migration and optimization

guy gueritz oil & gas business development...4 ©bull, 2012 gpu tech conference 2012 – san...

Documents

relatório de gestão de riscos e capital · margin capital...

connecting our world to yours · commentary highlights...

crop marks margin the wheely big cycle route map · margin...

annual general meeting 2016 - … · 2015 - highlights ....

direct investment 2018 - swiss national bank · investment...

isdirect mail dead? · 2018. 5. 29. · 2015 154.3 billion...

global economic significance of business events...2018/11/09...

direct investment 2018 - swiss national bank...4 direct...

crop marks margin margin fra margin … crop marks margin...

all nippon airways financial results fy11 first quarter ·...

lbg 2016 full year news release - final 21.2.17 · – net...

first direct image of extrasolar planets. 10.7 billion miles

2016 annual report - web viewwe generated $85.3 billion in...

2019in 2019, our revenues were $13.6 billion, and we...

murrob.mobimurrob.mobi/murrob2015/pdf/presentations/prelims/prelim... ·...

kpmg transport tracker€¦ · margin crop marks margin...

crop marks margin margin crop marks gcc listed banks’...

unitedhealth group reports first quarter results ·...

south asia and...

2018 - sumitomoseika · before net sales of 170 billion...