parallel computing with gpus - rwth aachen university€¦ · gpus for mathworks parallel computing...
TRANSCRIPT
MATLABParallel Computing with GPUs
November 2010
NVIDIA Confidential
November 2010
Joerg Krall, Sr. Business Development Manager PSG
MATLAB
Leading the Visual Computing Revolution
World leader in programmable graphics processor technologies
One of the world’s largest semiconductor companies
5,700 Employees World Wide
$1B Annual R&D Investment
NVIDIA Confidential
GeForceAmazeQuadro
Design
NVIDIA Confidential
TegraAnywhere
TeslaExplore
NVIDIA TeslaInfinite Possibilities in High Performance Computin g
Data Center Products Passive heatsink
C M S
NVIDIA Confidential
Single user WorkstationActive heatsink
Dawning Nebulae
Second Fastest Supercomputer in the World
1.27 Petaflop
4640 Tesla GPUs
NVIDIA Confidential
Wait for announcement tonight!!!
NVIDIA Confidential
Wait for announcement tonight!!!
2x Performance / Watt
4
5
6
7
8
Power MegaWatts
Jaguarx86 CPU
Nebulae
NVIDIA Confidential
0
1
2
3
0 500 1000 1500 2000
Linpack Performance (Teraflops)
RoadrunnerCell
JUGENEBlueGene
NebulaeTesla GPU
IPE, CASTesla GPU
Scaling to 5 PetaFlop Cluster
15
20
25
Power MegaWatts
20 Mwattx86 CPU
NVIDIA Confidential
0
5
10
0 1000 2000 3000 4000 5000 6000
Linpack Performance (Teraflops)
Jaguarx86 CPU
RoadrunnerCell
JUGENEBlueGene
NebulaeTesla GPU
10 MwattTesla GPU
8x Higher Linpack
656.1
600
750
PerformanceGflops
60
50
60
70
Performance / $Gflops / $K
656
600
800
Performance / wattGflops / kwatt
NVIDIA Confidential
80.1
0
150
300
450
CPU Server GPU-CPU Server
11
0
10
20
30
40
CPU Server GPU-CPU Server
146
0
200
400
CPU Server GPU-CPU Server
CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 G Hz, 48 GB memory, $7K, 0.55 kwGPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X 5550, 48 GB memory, $11K, 1.0 kw
GPU Servers Go Mainstream
NVIDIA Confidential
®
Tesla S870Dec 2007
Tesla S1070 / M10602008-2009
Tesla M2050 / M20702010
OEM Servers with Tesla M2050 GPUs
2 Tesla GPUs 4 Tesla GPUs 10 Tesla GPUs
®
8 Tesla GPUs
NVIDIA Confidential
SuperServer2 CPUs + 2 GPUs in 1U
Tetra2 CPUs + 4 GPUs in 1U
GreenBlade10 CPUs + 10 GPUs in 5U
B70152 CPUs + 8 GPUs in 4U
Announced OEM Servers w/ Tesla M-series GPUs
NVIDIA Confidential
®
The Tesla Visual SupercomputerReturn of the Scientific Workstation
4 TeraFlops Workstation4 CUDA GPUs1792 cores24 GB fast GPU memory
Specs:
NVIDIA Confidential
Specs:Quad-core CPU (1P or 2P)16 GB System memory4 Tesla/Quadro GPUs
Optimized for scientific computing
The power of a cluster in a workstation
NVIDIA Confidential
Teaching CUDA
NVIDIA Developer Eco-SystemDebuggers
& Profilers
cuda-gdb
NV Visual Profiler
Parallel Nsight
Visual Studio
Allinea
TotalView
MATLAB
Mathematica
NI LabView
pyCUDA
Numerical
Packages
C
C++
Fortran
OpenCL
DirectCompute
Java
Python
GPU Compilers
PGI Accelerator
CAPS HMPP
mCUDA
OpenMP
PGI CUDA x86
Parallelizing
Compilers
BLAS
FFT
LAPACK
NPP
Video
Libraries
NVIDIA Confidential
pyCUDA Video
Imaging
GPULib
OEM Solution ProvidersGPGPU Consultants & Training
ANEO GPU Tech
NVIDIA GPU Acceleration for MATLAB
PartnershipPartnership
GPU support GPU support NOW NOW for MATLABfor MATLAB
NVIDIA is the exclusive GPU partnerNVIDIA is the exclusive GPU partner
Double precision required (i.e. Tesla 10 series and later)Double precision required (i.e. Tesla 10 series and later)
Developed by The MathWorks using CUDA CDeveloped by The MathWorks using CUDA C
NVIDIA Confidential
StatusStatus
Released Released http://www.mathworks.com/discovery/matlab-gpu.html
Supported in Release 2010b with Parallel Computing Toolbox (PCT) and Supported in Release 2010b with Parallel Computing Toolbox (PCT) and MATLAB Distributed Computing Server (MDCS)MATLAB Distributed Computing Server (MDCS)
Everyone that comes in as a new hire already knows MATLAB… The learning curve is significantly lessened as a result.
“
“
NVIDIA Confidential
Jeff CornChief of Engineering Projects SectionU.S. Air Force
MATLAB makes GPUs more accessible
MATLAB Benefits• Faster time to discovery• Empowers scientist /
practitioner• No need for programming
expertise• No custom tools• Automated application
deployment
Language Language IntegrationIntegration
HighHigh--LevelLevelTechnical Technical
ComputingComputingLanguagesLanguages
NVIDIA Confidential
Scientist /Practitioner
Developer /Computer Scientist
Computational Expertise Domain Expertise
deployment
CUDA C / C++CUDA C / C++ 1 million+ MATLAB 1 million+ MATLAB licenseeslicensees
GPUs for MathWorks Parallel Computing Toolbox™and Distributed Computing Server™
Workstation Compute Cluster
NVIDIA Confidential
MATLAB Distributed Computing Server (MDCS)MATLAB Parallel Computing Toolbox (PCT)
• PCT enables high performance through parallel computing on workstations
• NVIDIA GPU acceleration now available
• MDCS allows a MATLAB PCT application to be submitted and run on a compute cluster
• NVIDIA GPU acceleration now available
MATLAB Performance with Tesla
MATLAB® mldivide PerformanceMatrix left division (A\b), Tesla C2050 vs. Core 2 Quad Q6600
NVIDIA Confidential
http://www.mathworks.com/products/parallel-computing/demos.html?file=/products/demos/shipping/distcomp/paralleldemo_gpu_backslash.html
MATLAB Performance with Tesla
14.0
16.0
18.0
20.0
Rel
ativ
e E
xecu
tion
Spe
ed
Relative Performance, Point-in-Polygon DemoCompared to Single Core CPU Baseline
Single Core CPU Quad Core CPU Single Core CPU + Tesla C1060 Quad Core CPU + Tesla C1060
NVIDIA Confidential
Core 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060, single precision operationshttp://www.mathworks.com/products/distriben/demos.html?file=/products/demos/distribtb/MapDemo/MapDemo.html
-
2.0
4.0
6.0
8.0
10.0
12.0
14.0
1,024 4,096 16,384 65,536
Rel
ativ
e E
xecu
tion
Spe
ed
Input Size
MATLAB Performance with Tesla
8.0
10.0
12.0
Rel
ativ
e E
xecu
tion
Spe
ed
Relative Performance, Black-Scholes DemoCompared to Single Core CPU Baseline
Single Core CPU Quad Core CPU Single Core CPU + Tesla C1060 Quad Core CPU + Tesla C1060
NVIDIA Confidential
Core 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060, single precision operations
-
2.0
4.0
6.0
8.0
256 K 1,024 K 4,096 K 16,384 K
Rel
ativ
e E
xecu
tion
Spe
ed
Input Size
Tesla 20-Series Double Precision Throughput
400.0
500.0
600.0
GFLOP/s Throughput for Tesla vs. GeForceMeasured Performance
GeForce GTX 480 Tesla C2050
NVIDIA Confidential
-
100.0
200.0
300.0
400.0
Multiply-Add (DMAD) Multiply (DMUL) Add (DADD)
Core i7-920 2.66 GHz, 6 GB RAM, Windows 7 64-bit, Tesla C1060 (ECC enabled), double precision operations
Summary of Options for Targeting GPUs
1) Use GPU array interface with MATLAB built-in functions
Greater C
ontrol
Across one or more GPUs on one or more machines:
24
2) Execute custom functions on elements of the GPU array
3) Create kernels from existing CUDA code and PTX files
Eas
e of
Use
Greater C
ontrol
What hardware is supported?
� NVIDIA hardware meeting the CUDA 1.3 hardware spec. � A listing can be found at:
http://www.nvidia.com/object/cuda_gpus.html
25
http://www.nvidia.com/object/cuda_gpus.html
How come function_xyz is not GPU-accelerated?
� The accelerated functions available in this first release were gated by available resources.
� We will add capabilities with coming releases based on
26
� We will add capabilities with coming releases based on requirements and feedback.
Why did we adopt CUDA and not OpenCL?
� CUDA has the only ecosystem with all of the libraries necessary for technical computing
27
Why are CUDA 1.1 and CUDA 1.2 not supported?
As mentioned earlier, CUDA 1.3 offers the following capabilities that earlier releases of CUDA do not
– Support for doubles. The base data type in MATLAB is double.
28
– Support for doubles. The base data type in MATLAB is double.
– IEEE compliance. We want to insure we get the correct answer.
– Cross-platform support.
What benchmarks are available?
� Benchmarks are available in the product and at www.mathworks.com/gpu
29
NVIDIA Tesla GPU Computing Products
1U Systems Workstation BoardsServer Module
Tesla M2070 / Tesla M2050
Tesla M1060 Tesla S2050 Tesla S1070Tesla C2070 / Tesla C2050
Tesla C1060
30
Tesla M2050 Tesla C2050
GPUs 1 T20 GPU 1 T10 GPU 4 T20 GPUs 4 T10 GPUs 1 T20 GPU 1 T10 GPU
Single Precision
1030 GFlops 933 GFlops 4120 GFlops 4140 GFlops 1030 Gflops 933 GFlops
Double Precision
515 Gflops 78 GFlops 2060 GFlops 346 GFlops 515 Gflops 78 GFlops
Memory 6 GB / 3 GB 4 GB 12 GB (S2050)16 GB
4 GB / GPU6 GB / 3 GB 4 GB
Mem BW 148.4 GB/s 102 GB/s 148.4 GB/s 102 GB/s 144 GB/s 102 GB/s
What to buy WorkstationRecommended Configurations
Power User• One or two 4/6 core CPUs• Two Tesla C2050 GPU• Quadro NVS 295• 8-12 GB RAM
NVIDIA Confidential
Mid-Range• One quad-core CPU• One Tesla C2050 or C2070 GPU• Quadro NVS 295• 4 GB RAM
Entry• One quad-core CPU• One Quadro 4000 GPU• 4 GB RAM
Tesla Benefits
Highest Computational Performance• High-speed double precision operations• Large dedicated memory• High-speed bi-directional PCI-Express communications• NVIDIA GPUDirect™ with InfiniBand
Most Reliable
NVIDIA Confidential
• ECC memory• Rigorous stress testing
Best Supported• OEM system integration• Professional support network• Long-term product lifecycle• 3 year warranty• Cluster & system management tools (server products)• Windows remote desktop support
OEM GPU Workstation Product Availability
OEM Product(s) Maximum # of Tesla GPUs
Dell T7500 1x
FTS R570 2x
FTS R670 2x
HP Z800 2x
NVIDIA Confidential
HP Z800 2x
HP Z400 1x
Lenovo D201x
(CTO)
Supermicro 7046GT-TRF 4x
Supermicro SYS-7046A-HR+ 2x
Tyan FT48-B7025 4x
Tyan FT72-B7015 8x
OEM GPU Server Product Availability
OEM Model Model # GPUs Comments
Appro GreenBlade GXB100 2x M2050 Pairs with CPU blade
Appro Tetra 1326G4 or 1426G4 4x M2050 1U
Bull Bullx blade B505 2x M1060 Pairs with CPU blade
Cray XE6 tbd 2x M2070 Blade for use in XE6 cabinets
Dell PowerEdge C410x C410x 16x M2050 or16x M1060
Can operate with fewer GPUs
Dell PowerEdge m610x m610x 1x M2050Or 1x M1060
Pairs with CPU blade
HP ProLiant SL 390 3x M2050 4U chassis with 4 SL390 “trays”…each tray is 3 GPU/
NVIDIA Confidential
HP ProLiant SL 390 3x M2050Or M1060
4U chassis with 4 SL390 “trays”…each tray is 3 GPU/ 2 CPU
IBM Blade Tbd 1x M2070 Up to 4 GPU blades ‘sandwich’ with a CPU blade
IBM iDataPlex dx360 M3 2x M2050 2U, half-depth
NextIO vCORE Express 2070 4x M2070 1U system built from same chassis as Tesla S2050
SGI Altix XE XE3001 2x M2050
Supermicro GPU superserver 6016/1026GT-TF-FM205 2x M2050 1026 is a 1U with more HDD bays than 6016GT
Supermicro GPU superserver 6016GT-TF-TM2 2x M1060
T-Platforms Tblade2 Tblade2 2x M2070
Tyan Tyan server FT72-B7015-N825/625 8x/6x M2050 4U
List of TPPs "@XI" Computer CorpACE ComputersAdvanced ClusteringAdvanced HPCAMAXAppro (SOEM/ODM)ASA ComputersAspen SystemsAtipa (dba Microtech)Colfax InternationalExxact TechnologiesGraphstreamHanweckHouston Information TeamHypertechnologie Ciara IncInternational Computer Concepts
Azken MugaBoston LTDCADNetwork
CARRI
Connoiseur Electonics
CRG Electronics Ltd
EMEA
AMERICAS
NVIDIA Confidential
JRTIKOI ComputersLUFACMicrogeo ChileMicroway911 Comp (formerly Wintel)NetDirectPadovaPCPC Direct LtdPenguin Computing, IncPSSC LabsRAID IncRAVE Computer IncRed Barn Technology GroupScalable InformaticsSeneca DataSIASASilicon Mechanics
CRG Electronics Ltd
DALCO
E4 Computer Engineering SPA
E-ON
FluiDyna
Hayat BILGI STI
Hinditron
Intersystem SRL
Locuz Enterprise Solution
MEGWARE
NetWeb
New Horizon IT
Sprinx
Transtec
Honghutech Co., Ltd
Leaders Systems (CNS)
Miruware
Novatte Pte Ltd
Taknet Systems Pte Ltd
TSTI (Tatung System Tech. Inc)
Xenon Systems Pty Ltd
APAC/JAPAN
ResourcesGTC session “GPU Computing with MATLAB®”, Loren Dean, MathWorkshttp://developer.download.nvidia.com/compute/cuda/d ocs/GTC_2010_Archives.htm
GPUs for Parallel Computing Toolbox http://www.mathworks.com/gpuhttp://www.mathworks.com/products/parallel-computin g/http://www.mathworks.com/products/datasheets/pdf/pa rallel-computing-toolbox.pdf
Speeding Up MATLAB Computations with GPUshttp://www.mathworks.com/products/parallel -computing/description5.html
NVIDIA Confidential
http://www.mathworks.com/products/parallel -computing/description5.html
MATLAB benchmarking examples on the GPUhttp://www.mathworks.com/products/parallel-computin g/demos.html?file=/products/demos/shipping/distcomp /paralleldemo_gpu_backslash.htmlhttp://www.mathworks.com/products/distriben/demos.h tml?file=/products/demos/distribtb/MapDemo/MapDemo. html
Product Trialhttp://www.mathworks.com/programs/trials/trial_requ est.html?s_cid=SA_prod_distcomp_parallel_computing_ ipspot_trial&prodcode=DM,ML&eventid=673640837
R2010b Press Releasehttp://www.mathworks.com/company/pressroom/articles /article51639.html