parallel computing with gpus - rwth aachen university€¦ · gpus for mathworks parallel computing...

MATLABParallel Computing with GPUs

November 2010

NVIDIA Confidential

November 2010

Joerg Krall, Sr. Business Development Manager PSG

MATLAB

Leading the Visual Computing Revolution

World leader in programmable graphics processor technologies

One of the world’s largest semiconductor companies

5,700 Employees World Wide

$1B Annual R&D Investment

NVIDIA Confidential

GeForceAmazeQuadro

Design

NVIDIA Confidential

TegraAnywhere

TeslaExplore

NVIDIA TeslaInfinite Possibilities in High Performance Computin g

Data Center Products Passive heatsink

C M S

NVIDIA Confidential

Single user WorkstationActive heatsink

Dawning Nebulae

Second Fastest Supercomputer in the World

1.27 Petaflop

4640 Tesla GPUs

NVIDIA Confidential

Wait for announcement tonight!!!

NVIDIA Confidential

Wait for announcement tonight!!!

2x Performance / Watt

4

5

6

7

8

Power MegaWatts

Jaguarx86 CPU

Nebulae

NVIDIA Confidential

0

1

2

3

0 500 1000 1500 2000

Linpack Performance (Teraflops)

RoadrunnerCell

JUGENEBlueGene

NebulaeTesla GPU

IPE, CASTesla GPU

Scaling to 5 PetaFlop Cluster

15

20

25

Power MegaWatts

20 Mwattx86 CPU

NVIDIA Confidential

0

5

10

0 1000 2000 3000 4000 5000 6000

Linpack Performance (Teraflops)

Jaguarx86 CPU

RoadrunnerCell

JUGENEBlueGene

NebulaeTesla GPU

10 MwattTesla GPU

8x Higher Linpack

656.1

600

750

PerformanceGflops

60

50

60

70

Performance / $Gflops / $K

656

600

800

Performance / wattGflops / kwatt

NVIDIA Confidential

80.1

0

150

300

450

CPU Server GPU-CPU Server

11

0

10

20

30

40


146

0

200

400


CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 G Hz, 48 GB memory, $7K, 0.55 kwGPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X 5550, 48 GB memory, $11K, 1.0 kw

GPU Servers Go Mainstream

NVIDIA Confidential

®

Tesla S870Dec 2007

Tesla S1070 / M10602008-2009

Tesla M2050 / M20702010

OEM Servers with Tesla M2050 GPUs

2 Tesla GPUs 4 Tesla GPUs 10 Tesla GPUs

®

8 Tesla GPUs

NVIDIA Confidential

SuperServer2 CPUs + 2 GPUs in 1U

Tetra2 CPUs + 4 GPUs in 1U

GreenBlade10 CPUs + 10 GPUs in 5U

B70152 CPUs + 8 GPUs in 4U

Announced OEM Servers w/ Tesla M-series GPUs

NVIDIA Confidential

®

The Tesla Visual SupercomputerReturn of the Scientific Workstation

4 TeraFlops Workstation4 CUDA GPUs1792 cores24 GB fast GPU memory

Specs:

NVIDIA Confidential

Specs:Quad-core CPU (1P or 2P)16 GB System memory4 Tesla/Quadro GPUs

Optimized for scientific computing

The power of a cluster in a workstation

NVIDIA Confidential

Teaching CUDA

NVIDIA Developer Eco-SystemDebuggers

& Profilers

cuda-gdb

NV Visual Profiler

Parallel Nsight

Visual Studio

Allinea

TotalView

MATLAB

Mathematica

NI LabView

pyCUDA

Numerical

Packages

C

C++

Fortran

OpenCL

DirectCompute

Java

Python

GPU Compilers

PGI Accelerator

CAPS HMPP

mCUDA

OpenMP

PGI CUDA x86

Parallelizing

Compilers

BLAS

FFT

LAPACK

NPP

Video

Libraries

NVIDIA Confidential

pyCUDA Video

Imaging

GPULib

OEM Solution ProvidersGPGPU Consultants & Training

ANEO GPU Tech

NVIDIA GPU Acceleration for MATLAB

PartnershipPartnership

GPU support GPU support NOW NOW for MATLABfor MATLAB

NVIDIA is the exclusive GPU partnerNVIDIA is the exclusive GPU partner

Double precision required (i.e. Tesla 10 series and later)Double precision required (i.e. Tesla 10 series and later)

Developed by The MathWorks using CUDA CDeveloped by The MathWorks using CUDA C

NVIDIA Confidential

StatusStatus

Released Released http://www.mathworks.com/discovery/matlab-gpu.html

Supported in Release 2010b with Parallel Computing Toolbox (PCT) and Supported in Release 2010b with Parallel Computing Toolbox (PCT) and MATLAB Distributed Computing Server (MDCS)MATLAB Distributed Computing Server (MDCS)

Everyone that comes in as a new hire already knows MATLAB… The learning curve is significantly lessened as a result.

“

“

NVIDIA Confidential

Jeff CornChief of Engineering Projects SectionU.S. Air Force

MATLAB makes GPUs more accessible

MATLAB Benefits• Faster time to discovery• Empowers scientist /

practitioner• No need for programming

expertise• No custom tools• Automated application

deployment

Language Language IntegrationIntegration

HighHigh--LevelLevelTechnical Technical

ComputingComputingLanguagesLanguages

NVIDIA Confidential

Scientist /Practitioner

Developer /Computer Scientist

Computational Expertise Domain Expertise

deployment

CUDA C / C++CUDA C / C++ 1 million+ MATLAB 1 million+ MATLAB licenseeslicensees

GPUs for MathWorks Parallel Computing Toolbox™and Distributed Computing Server™

Workstation Compute Cluster

NVIDIA Confidential

MATLAB Distributed Computing Server (MDCS)MATLAB Parallel Computing Toolbox (PCT)

• PCT enables high performance through parallel computing on workstations

• NVIDIA GPU acceleration now available

• MDCS allows a MATLAB PCT application to be submitted and run on a compute cluster

• NVIDIA GPU acceleration now available

MATLAB Performance with Tesla

MATLAB® mldivide PerformanceMatrix left division (A\b), Tesla C2050 vs. Core 2 Quad Q6600

NVIDIA Confidential

http://www.mathworks.com/products/parallel-computing/demos.html?file=/products/demos/shipping/distcomp/paralleldemo_gpu_backslash.html


14.0

16.0

18.0

20.0

Rel

ativ

e E

xecu

tion

Spe

ed

Relative Performance, Point-in-Polygon DemoCompared to Single Core CPU Baseline

Single Core CPU Quad Core CPU Single Core CPU + Tesla C1060 Quad Core CPU + Tesla C1060

NVIDIA Confidential

Core 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060, single precision operationshttp://www.mathworks.com/products/distriben/demos.html?file=/products/demos/distribtb/MapDemo/MapDemo.html

-

2.0

4.0

6.0

8.0

10.0

12.0

14.0

1,024 4,096 16,384 65,536

Rel

ativ

e E

xecu

tion

Spe

ed

Input Size


8.0

10.0

12.0

Rel

ativ

e E

xecu

tion

Spe

ed

Relative Performance, Black-Scholes DemoCompared to Single Core CPU Baseline

Single Core CPU Quad Core CPU Single Core CPU + Tesla C1060 Quad Core CPU + Tesla C1060

NVIDIA Confidential

Core 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060, single precision operations

-

2.0

4.0

6.0

8.0

256 K 1,024 K 4,096 K 16,384 K

Rel

ativ

e E

xecu

tion

Spe

ed

Input Size

Tesla 20-Series Double Precision Throughput

400.0

500.0

600.0

GFLOP/s Throughput for Tesla vs. GeForceMeasured Performance

GeForce GTX 480 Tesla C2050

NVIDIA Confidential

-

100.0

200.0

300.0

400.0

Multiply-Add (DMAD) Multiply (DMUL) Add (DADD)

Core i7-920 2.66 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060 (ECC enabled), double precision operations

Summary of Options for Targeting GPUs

1) Use GPU array interface with MATLAB built-in functions

Greater C

ontrol

Across one or more GPUs on one or more machines:

24

2) Execute custom functions on elements of the GPU array

3) Create kernels from existing CUDA code and PTX files

Eas

e of

Use

Greater C

ontrol

What hardware is supported?

� NVIDIA hardware meeting the CUDA 1.3 hardware spec. � A listing can be found at:

http://www.nvidia.com/object/cuda_gpus.html

25

http://www.nvidia.com/object/cuda_gpus.html

How come function_xyz is not GPU-accelerated?

� The accelerated functions available in this first release were gated by available resources.

� We will add capabilities with coming releases based on

26

� We will add capabilities with coming releases based on requirements and feedback.

Why did we adopt CUDA and not OpenCL?

� CUDA has the only ecosystem with all of the libraries necessary for technical computing

27

Why are CUDA 1.1 and CUDA 1.2 not supported?

As mentioned earlier, CUDA 1.3 offers the following capabilities that earlier releases of CUDA do not

– Support for doubles. The base data type in MATLAB is double.

28

– Support for doubles. The base data type in MATLAB is double.

– IEEE compliance. We want to insure we get the correct answer.

– Cross-platform support.

What benchmarks are available?

� Benchmarks are available in the product and at www.mathworks.com/gpu

29

NVIDIA Tesla GPU Computing Products

1U Systems Workstation BoardsServer Module

Tesla M2070 / Tesla M2050

Tesla M1060 Tesla S2050 Tesla S1070Tesla C2070 / Tesla C2050

Tesla C1060

30

Tesla M2050 Tesla C2050

GPUs 1 T20 GPU 1 T10 GPU 4 T20 GPUs 4 T10 GPUs 1 T20 GPU 1 T10 GPU

Single Precision

1030 GFlops 933 GFlops 4120 GFlops 4140 GFlops 1030 Gflops 933 GFlops

Double Precision

515 Gflops 78 GFlops 2060 GFlops 346 GFlops 515 Gflops 78 GFlops

Memory 6 GB / 3 GB 4 GB 12 GB (S2050)16 GB

4 GB / GPU6 GB / 3 GB 4 GB

Mem BW 148.4 GB/s 102 GB/s 148.4 GB/s 102 GB/s 144 GB/s 102 GB/s

What to buy WorkstationRecommended Configurations

Power User• One or two 4/6 core CPUs• Two Tesla C2050 GPU• Quadro NVS 295• 8-12 GB RAM

NVIDIA Confidential

Mid-Range• One quad-core CPU• One Tesla C2050 or C2070 GPU• Quadro NVS 295• 4 GB RAM

Entry• One quad-core CPU• One Quadro 4000 GPU• 4 GB RAM

Tesla Benefits

Highest Computational Performance• High-speed double precision operations• Large dedicated memory• High-speed bi-directional PCI-Express communications• NVIDIA GPUDirect™ with InfiniBand

Most Reliable

NVIDIA Confidential

• ECC memory• Rigorous stress testing

Best Supported• OEM system integration• Professional support network• Long-term product lifecycle• 3 year warranty• Cluster & system management tools (server products)• Windows remote desktop support

OEM GPU Workstation Product Availability

OEM Product(s) Maximum # of Tesla GPUs

Dell T7500 1x

FTS R570 2x

FTS R670 2x

HP Z800 2x

NVIDIA Confidential

HP Z800 2x

HP Z400 1x

Lenovo D201x

(CTO)

Supermicro 7046GT-TRF 4x

Supermicro SYS-7046A-HR+ 2x

Tyan FT48-B7025 4x

Tyan FT72-B7015 8x

OEM GPU Server Product Availability

OEM Model Model # GPUs Comments

Appro GreenBlade GXB100 2x M2050 Pairs with CPU blade

Appro Tetra 1326G4 or 1426G4 4x M2050 1U

Bull Bullx blade B505 2x M1060 Pairs with CPU blade

Cray XE6 tbd 2x M2070 Blade for use in XE6 cabinets

Dell PowerEdge C410x C410x 16x M2050 or16x M1060

Can operate with fewer GPUs

Dell PowerEdge m610x m610x 1x M2050Or 1x M1060

Pairs with CPU blade

HP ProLiant SL 390 3x M2050 4U chassis with 4 SL390 “trays”…each tray is 3 GPU/

NVIDIA Confidential

HP ProLiant SL 390 3x M2050Or M1060

4U chassis with 4 SL390 “trays”…each tray is 3 GPU/ 2 CPU

IBM Blade Tbd 1x M2070 Up to 4 GPU blades ‘sandwich’ with a CPU blade

IBM iDataPlex dx360 M3 2x M2050 2U, half-depth

NextIO vCORE Express 2070 4x M2070 1U system built from same chassis as Tesla S2050

SGI Altix XE XE3001 2x M2050

Supermicro GPU superserver 6016/1026GT-TF-FM205 2x M2050 1026 is a 1U with more HDD bays than 6016GT

Supermicro GPU superserver 6016GT-TF-TM2 2x M1060

T-Platforms Tblade2 Tblade2 2x M2070

Tyan Tyan server FT72-B7015-N825/625 8x/6x M2050 4U

List of TPPs "@XI" Computer CorpACE ComputersAdvanced ClusteringAdvanced HPCAMAXAppro (SOEM/ODM)ASA ComputersAspen SystemsAtipa (dba Microtech)Colfax InternationalExxact TechnologiesGraphstreamHanweckHouston Information TeamHypertechnologie Ciara IncInternational Computer Concepts

Azken MugaBoston LTDCADNetwork

CARRI

Connoiseur Electonics

CRG Electronics Ltd

EMEA

AMERICAS

NVIDIA Confidential

JRTIKOI ComputersLUFACMicrogeo ChileMicroway911 Comp (formerly Wintel)NetDirectPadovaPCPC Direct LtdPenguin Computing, IncPSSC LabsRAID IncRAVE Computer IncRed Barn Technology GroupScalable InformaticsSeneca DataSIASASilicon Mechanics

CRG Electronics Ltd

DALCO

E4 Computer Engineering SPA

E-ON

FluiDyna

Hayat BILGI STI

Hinditron

Intersystem SRL

Locuz Enterprise Solution

MEGWARE

NetWeb

New Horizon IT

Sprinx

Transtec

Honghutech Co., Ltd

Leaders Systems (CNS)

Miruware

Novatte Pte Ltd

Taknet Systems Pte Ltd

TSTI (Tatung System Tech. Inc)

Xenon Systems Pty Ltd

APAC/JAPAN

ResourcesGTC session “GPU Computing with MATLAB®”, Loren Dean, MathWorkshttp://developer.download.nvidia.com/compute/cuda/d ocs/GTC_2010_Archives.htm

GPUs for Parallel Computing Toolbox http://www.mathworks.com/gpuhttp://www.mathworks.com/products/parallel-computin g/http://www.mathworks.com/products/datasheets/pdf/pa rallel-computing-toolbox.pdf

Speeding Up MATLAB Computations with GPUshttp://www.mathworks.com/products/parallel -computing/description5.html

NVIDIA Confidential

http://www.mathworks.com/products/parallel -computing/description5.html

MATLAB benchmarking examples on the GPUhttp://www.mathworks.com/products/parallel-computin g/demos.html?file=/products/demos/shipping/distcomp /paralleldemo_gpu_backslash.htmlhttp://www.mathworks.com/products/distriben/demos.h tml?file=/products/demos/distribtb/MapDemo/MapDemo. html

Product Trialhttp://www.mathworks.com/programs/trials/trial_requ est.html?s_cid=SA_prod_distcomp_parallel_computing_ ipspot_trial&prodcode=DM,ML&eventid=673640837

R2010b Press Releasehttp://www.mathworks.com/company/pressroom/articles /article51639.html

[email protected]

37

Thanks

parallel computing with gpus - rwth aachen university€¦ · gpus for mathworks parallel computing...

Documents