implementing rtm on dense gpu platforms

37
Implementing RTM on Dense GPU Platforms Geert Wenes – Cray, Inc. Ty McKercher – NVIDIA

Upload: cray-inc

Post on 13-Jan-2017

642 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Implementing RTM on Dense GPU Platforms

Implementing RTM on Dense GPU Platforms

Geert Wenes – Cray, Inc.

Ty McKercher – NVIDIA

Page 2: Implementing RTM on Dense GPU Platforms

Cray Vision: Fusion of Supercomputing and Big (Fast) Data

Copyright 2015 Cray Inc. 2

Modeling The World

Data-IntensiveProcessing

Math Models

Simulation and modeling of the natural world via

mathematical equations.

Data Models

Analysis of large datasets for knowledge discovery, insight, and prediction.

Feeding scientific, sensor, and internet data into simulations

Analytic processing of simulation output

Compute Store Analyze

Page 3: Implementing RTM on Dense GPU Platforms

Chart source: Henri Calandra - Total

Elastic &

Visco Elastic

Full WE ApproximationHPC Evolution

N=1 TOP 500

1995 2000 2005 2010 2015

1 TF

10 TF

10 PF

100 TF

100 PF

1 PF

1 EF

1990

Paraxial WE

approximation

Kirchhoff beam

Post SDM

PreSTM

Acoustic &

Anisotropic

2020

HPC Evolution:

TOTAL EP

RTM-FWI

L2RTM

Cray system + new algorithms:

“Instead of thousands of years, we

can now process a full FWI survey in

a matter of weeks or days, depending

on the amount of data and complexity

of the rocks in the subsurface.”

Steve Derenthal,

in The Lamp, 2012-2.

Processing: Algorithmic Complexity Increasing

Copyright 2015 Cray Inc.3

Page 4: Implementing RTM on Dense GPU Platforms

Petroleum Geo-Services (PGS) Selects High End Cray XC Series Supercomputer for Seismic Processing

• One of the largest ever commercial supercomputers

• 5 PetaFlop XC40 Supercomputer Performance

• Seismic Processing and Imaging focus

– Subsurface maps and 3-D models

• PGS win based on Cray’s:

– Competitive advantage over other O&G service suppliers

• Performance - Increased processing capacity

• Throughput - Faster turn-around on seismic jobs

– Compute efficiency & reliability

– Supportive partnership

• Integrated Cray configuration includes:

– High performance XC40 configuration

– Integrated Sonexion 2000 storage system PGS researcher’s codes scaled and performed beyond the competition

“Abel”

Copyright 2015 Cray Inc.

#12 - June 2015 “Top 500”(#1 Commercial System)

Page 5: Implementing RTM on Dense GPU Platforms

Images © Schlumberger

And © SeaBird Technologies

Node Wide Azimuth Coil Broadband

• Permanently installed

• Repeated acquisition

• Broader workflow use

• Larger surveys

• Drives algorithmic

complexity

• Larger surveys

• More survey

components

• Drives algorithmic

complexity

• Larger surveys

• Drives algorithmic

complexity

Acquisition: Variety, Volume & Velocity Increasing

Copyright 2015 Cray Inc.

Page 6: Implementing RTM on Dense GPU Platforms

• What Geoscientists want

• How GPUs scale

• Why the application range is broadening

• Why RTM is a natural fit-for-GPUs

• How you port to GPUs

• Where you can find more information

October 2015 SHPCP Technical Presentation at SEG 2014 6

Page 7: Implementing RTM on Dense GPU Platforms

Why GPUs are used in seismic processing

October 2015 SHPCP Technical Presentation at SEG 2015 7

30 Hz7 days to process

1,000 CPU-only nodesversus

300 GPU-accelerated nodes

1000:3003-to-1 productivity gain

Page 8: Implementing RTM on Dense GPU Platforms

Why GPUs are used in seismic processing

October 2015 SHPCP Technical Presentation at SEG 2015 8

30 Hz7 days to process

1,000 CPU-only nodesversus

300 GPU-accelerated nodes

1000:3003-to-1 productivity gain

60 HzWeeks to process

15,000 CPU-only nodesversus5,000 GPU-accelerated nodes

15000:50003-to-1 productivity gain

Page 9: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 9

More Physics

More Scenarios

Bigger Models

Page 10: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 10

High frequency

•Dense Sampling

Large memory

Multiple GPUs

Halo exchange

Inter-GPU comm

30 Hz 60 Hz

Page 11: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 11

GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

Host

divide propertyvolumes

among multiple GPUs

Page 12: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 12

GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

P2P P2PP2P

Host

divide volumes into

halo and inner-region

domains

Page 13: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 13

GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

P2P P2PP2P

Host

streams for halo

calculations, and

halo data exchange

Page 14: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 14

GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

P2P P2PP2P

Host

streams to launch

inner-region

calculations

Page 15: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 15

peer-to-peer exchange

between GPUs

overlap halo operations andinner-region calculations

GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

P2P P2PP2P

Host

Page 16: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 16

CPU CPUQPI

DRAM

DRAM

PCIe PCIe PCIe PCIe

Page 17: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 17

CPU CPUQPI

DRAM

DRAM

PCIe PCIe

PLX

0 1

K80

PLX

2 3

K80

PLX

4 5

K80

PLX

6 7

K80

PCIe PCIe

PLX

8 9

K80

PLX

10 11

K80

PLX

12 13

K80

PLX

14 15

K80

Single Server + 8x Tesla K80:192 GB GPU Memory39,936 CUDA cores64.8 TFLOPs (peak, fp32)

Page 18: Implementing RTM on Dense GPU Platforms

Performance scaling with multiple GPUs

October 2015 SHPCP Technical Presentation at SEG 2015 18

Page 19: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 19

Page 20: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 20

Peak performanceHigher bandwidth

Stacked memory

CPU & GPU-to-GPU interconnect

NVLink high-speed interconnectSingle memory space

Unified memory

2016(Pascal)

Next-generation Pascal

Page 21: Implementing RTM on Dense GPU Platforms

Why the application range is broadening

October 2015 SHPCP Technical Presentation at SEG 2015 21

2008-2010 2012+

RTM FWI

KDM

KTM

WEM PSPI

Elastic Modeling

SRME

CSEM

2007

Page 22: Implementing RTM on Dense GPU Platforms

• Regular access patterns, 80% peak memory bandwidth (480 GB/s)

• Hardware-based math operations

• Communication costs hidden by overlapping computation

• Entire TTI shots per node on multi-GPU systems

October 2015 SHPCP Technical Presentation at SEG 2014 22

Page 23: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 23

Assess

Parallelize

Optimize

Deploy

Page 24: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 24

Assess

Parallelize

Optimize

Deploy

Profile using familiar tools

Page 25: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 25

Assess

Parallelize

Optimize

Deploy 3 ways to accelerate apps

Libraries

Directives

Languages

Page 26: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 26

Assess

Parallelize

Optimize

DeployGuided analysis

Page 27: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 27

Assess

Parallelize

Optimize

DeployMulti-GPU system advantage

Page 28: Implementing RTM on Dense GPU Platforms

October 2015 SHPCP Technical Presentation at SEG 2015 28

Schlumberger CGG TGS-Nopec ENI Chevron Petrobras Statoil

Hess Seismic City Spectraseis Acceleware Stanford U of Chicago Kaust

Page 29: Implementing RTM on Dense GPU Platforms

• More physics, more scenarios, bigger models

• Linear scaling across multiple GPUs

• Multi-GPU system advantage: improve throughput & productivity

October 2015 SHPCP Technical Presentation at SEG 2014 29

Seismic imaging workloads on GPUs

Page 30: Implementing RTM on Dense GPU Platforms

Realizing dense GPU Computation for RTM

At Scale and In Production

Page 31: Implementing RTM on Dense GPU Platforms

31

As a Seismic Migration Application

• More Physics and Features (RTM(VTI,TTI),

L2RTM, eRTM)

• Implementation Issues/Choices

• Possible strong migration artifacts

• High computational cost (W~N^4)

• Imaging condition

• Implementation Schemes (Explicit FD, (pseudo)-

spectral)

As part of a Critical Workflow

• Preconditioned Data, Model Building, Post-image

Processing

• Integrated with complimentary migration schemes

(e.g. Kirchhoff)

• Wide range of Tradeoffs

• disk/snapshots for source wavefield

construction,

• in-memory processing

• Partial imaging, de/ re-migration

Copyright 2015 Cray Inc.

Page 32: Implementing RTM on Dense GPU Platforms

CS-Storm

• Performance & Technology• Performance at high resolution

• Technology longevity for many-core technologies

• Productivity • Open Software Development Environment

• Communication in a distributed memory

environment

• Workload & Storage Management• Dynamic workload management tools

• Optimize data location to efficiently utilize storage

• TCO - Cost-effectiveness• Optimize power consumption, leverage green

computing initiatives

• Reduce time-to-production

R&D Dev Systems Facilities

Algorithms

Performance

Technology

Productivity

I/T Processes

& Standards

WLM/Utilization

Storage/FS/IO Power/Cooling

(Remote)

Access

SLAs: Time-to-solution, Availability

Copyright 2015 Cray Inc.

Algorithmic

Complexity

@Ever Increasing

Fidelity and

Functionality

Data Acquisition

@Ever Increasing

Volume,

Velocity (and Variety)

Page 33: Implementing RTM on Dense GPU Platforms

33

• Seismology community code, proxy tor seismic application

• CUDA version - developed by Daniel Peter, ETH

• Data - courtesy BP & Princeton (3D elastic, isotropic model)

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

1 2 4 8 16

#GPUs (K40)

SPECFEM3D SPEED-UP

simple_model BP Demo Ideal

SPECFEM3D Strong Scaling, using a complex

model on multi-GPU CS-Storm servers

Elastic 3D BP model; Density, Vp, and Vs

Rho’

N’

Rho’

SPECFEM3D SPEED-UP

Copyright 2015 Cray Inc.

Page 34: Implementing RTM on Dense GPU Platforms

SpecFEM3D linear scaling on K40 and K80

34

• Almost perfect scaling by adding more GPUs and with K80 vs K40

• K80 nodes showing a clear performance advantage: ~2x the

performance

0

50

100

150

200

250

300

350

400

450

0 2 4 6 8 10 12 14 16 18

Wa

ll c

loc

k tim

e (s

ec

)# GPU cards

SpecFEM3D - Strong scalingK40 to K80 Performance improvements

K80

K40

Linear scaling with

the number of

GPUs ½ run time per

node by using K80

nodes

Copyright 2015 Cray Inc.

Page 35: Implementing RTM on Dense GPU Platforms

•Complete cluster, server, network and storage management

•Extreme scalability and ease of use

•Partitioning; job scheduler support; revision system with rollback; automatic network/server discovery and failover

Cray Advanced Cluster Engine

(ACE™)

•Cray Compiling Environment

•Cray Scientific and Math Libraries

•Cray Performance Measurement and Analysis Tools

Cray Programming

Environment on CS

•Open-Source and Partner ToolsComplete SW

Ecosystem

35

Copyright 2015 Cray Inc.

Page 36: Implementing RTM on Dense GPU Platforms

14U

14U

14U

42U

• Five CS-Storm nodes mounted vertically in 14RU cube

• Datacenter-friendly cooling options– Air cooled for versatility– Support for liquid cooled rear door heat

exchangers for room-neutral cooling

Support for 19” or 24” Cabinet

Copyright 2015 Cray Inc.

Page 37: Implementing RTM on Dense GPU Platforms

Powerful and Efficient

•Uncompromising performance in a single-rack system

•Full system solution featuring Cray management and programming environment.

•Maximum efficiency for scalable GPU applications

Performanceby Design

•Power and cooling to spare, allows GPUs to run at full power.

•Designed for upgradeability to protect your investment

Cray Service and Reliability

•Redundancy, data protection and serviceability

•Cray expertise

CS-Storm delivers the best possible efficiency and

performance for RTM applications

Copyright 2015 Cray Inc.