parallel computing with gpus - rwth aachen university · 2012-09-21 · 15 matlab gpu computing...
TRANSCRIPT
1
MATLAB Parallel Computing with GPUs
November 2011
Jörg Krall
Sr. Business Development Manager
Professional Solutions Group
MATLAB
2
NVIDIA FACTS:
Founded in 1993
Fastest semiconductor company to reach $1 billion in revenue
FY11: $3.5 billion in revenue
6,800 employees in 20 countries
1,900 patents
Headquartered in Santa Clara, Calif.
6
CPU Pizza Delivery
Process:
Delivery truck
delivers one pizza
and then moves to
next house
Original Idea by Jedox www.jedox.com
7
NVIDIA GPU Pizza Delivery
Process:
Many deliveries to
many houses
Original Idea by Jedox www.jedox.com
8
CUDA Developer Community Growth
0
500
1000
1500
2000
2500
3000
2005 2006 2007 2008 2009
NVIDIA GPGPU: Papers and Articles
CUDA Capable GPUs 350,000,000
CUDA Toolkit Downloads 1,000,000
Active CUDA Developers 150,000
Universities Teaching CUDA 470
% OEMs offer CUDA GPU PCs 100
9
Tianhe-1A 7168 Tesla GPU’s 2.5 PFLOPS
Nebulae 4650 Tesla GPU’s 1.2 PFLOPS
We not only created the world's fastest computer, but also implemented a heterogeneous computing architecture incorporating CPU and GPU, this is a new innovation. ” Premier Wen Jiabao
Public comments acknowledging Tianhe-1A
“
Tsubame 2.0 4224 Tesla GPU’s 1.194 PFLOPS
Tesla GPUs Power Top Supercomputers
10
NCSA Mixes GPUs into Blue Waters
NCSA is excited about the inclusion of NVIDIA's Tesla GPUs in Blue Waters. GPUs provide extraordinary capabilities for numerically-intensive computations and a cost-effective, energy-efficient way to build tomorrow's petascale supercomputers.
“
” Thom Dunning Director, NCSA
11
Titan at Oak Ridge
World’s Top Open Science Computing Research Facility
2x Faster, 3x More Energy Efficient
than Current #1 (K Computer)
18,000 Tesla GPUs
20+ PetaFlops
~90% of flops from GPUs
12
Options for Targeting GPUs with MATLAB
Built-in MATLAB functions
User-defined MATLAB functions
User-defined CUDA kernels
Ea
se
of
Us
e
Gre
ate
r Co
ntro
l
14
Benchmark: Solving 2D Wave Equation CPU vs. GPU
Intel Xeon Processor X5650, NVIDIA Tesla C2050 GPU
Grid Size CPU (s) GPU
(s) Speedup
64 x 64 0.1004 0.3553 0.28
128 x 128 0.1931 0.3368 0.57
256 x 256 0.5888 0.4217 1.4
512 x 512 2.8163 0.8243 3.4
1024 x 1024 13.4797 2.4979 5.4
2048 x 2048 74.9904 9.9567 7.5
* Note: data displayed on log scale
15
MATLAB GPU Computing Examples
4x speedup in adaptive filtering
routine (part of acoustic tracking
algorithm)
4x speedup in wave equation solving
(part of seismic data processing
algorithm)
3x speedup in estimating 7.6
million contract prices using Black-
Scholes model
14x speedup in template matching
routine (part of cancer cell image
analysis)
10x speedup in data clustering via K-
means clustering algorithm
17x speedup in simulating the movement of
3072 celestial objects
16
GPUs for MathWorks Parallel Computing Toolbox™
and Distributed Computing Server™
Workstation Compute Cluster
MATLAB Distributed Computing Server (MDCS) MATLAB Parallel Computing Toolbox (PCT)
• PCT enables high performance through
parallel computing on workstations
• NVIDIA GPU acceleration available now
• MDCS allows a MATLAB PCT application to be
submitted and run on a compute cluster
• NVIDIA GPU acceleration available now
17
Resources MATLAB Digest Article – GPU Programming in MATLAB http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html
GPU Computing with MATLAB webinar http://www.mathworks.com/company/events/webinars/wbnr59816.html
MATLAB benchmarking examples on the GPU http://www.mathworks.com/matlabcentral/fileexchange/?term=gpu
http://www.mathworks.com/products/parallel-computing/demos.html?file=/products/demos/shipping/distcomp/paralleldemo_gpu_backslash.html
http://www.mathworks.com/products/distriben/demos.html?file=/products/demos/distribtb/MapDemo/MapDemo.html
GPUs for Parallel Computing Toolbox http://www.mathworks.com/gpu
http://www.nvidia.com/object/tesla-matlab-accelerations.html
http://www.mathworks.com/products/parallel-computing/
http://www.mathworks.com/products/datasheets/pdf/parallel-computing-toolbox.pdf
Product Trial http://www.mathworks.com/programs/trials/trial_request.html?s_cid=SA_prod_distcomp_parallel_computing_ipspot_trial&prodcode=DM,ML&eventid=6736
40837
18
Special MATLAB-user pricing of GPU
enabled HP workstations
GPU enabled workstations for MATLAB users
http://www.tsa.com/promotions/promotional_details.php?id=54
19
Workstations Servers & Blades
Tesla Data Center & Workstation GPU Solutions
Tesla M-series GPUs M2090 | M2070 | M2050
Tesla C-series GPUs C2070 | C2050
M2090 M2070 M2050
Cores 512 448 448
Memory 6 GB 6 GB 3 GB
Memory bandwidth
(ECC off) 177.6 GB/s 150 GB/s 148.8 GB/s
Peak
Perf
Gflops
Single
Precision 1331 1030 1030
Double
Precision 665 515 515
C2070 C2050
448 448
6 GB 3 GB
148.8 GB/s 148.8 GB/s
1030 1030
515 515
NVIDIA Confidential
20
Workstations
2 to 4 Tesla GPUs
Integrated CPU-GPU
Servers & Blades
Tesla Data Center & Workstation GPU Solutions
Tesla M-series GPUs M2090 M2070 M2050
Tesla C-series GPUs C2075 C2070 C2050
M C
21
OEM Servers with Tesla M20xx GPUs
2 Tesla GPUs
SuperServer 2 CPUs + 2 GPUs in 1U
Tetra 2 CPUs + 4 GPUs in 1U
GreenBlade 10 CPUs + 10 GPUs in 5U
4 Tesla GPUs 10 Tesla GPUs
®
B7015 2 CPUs + 8 GPUs in 4U
8 Tesla GPUs
22
®
OEM Servers with Tesla M20xx GPUs
Key Systems from Global OEMs System Name (or
“codename”)
# of GPUs # of CPU
Sockets
Ratio
(GPU:CPU)
Minimum rack
config
Effective
GPUs/1RU
Ship Date
Dell C410x
16 None – pairs
with external
host
2:1 now
8:1 w/ new
hosts
5U
(3U for C410x +
2U for hosts)
3.2 Now
Dell “orca” 2 2 1:1 2U 1 Nov 2011
Dell “Ghostrider” 4 2 2:1 4U 1 Nov 2011
HP SL 390 “2U
tray”
3 2 3:2 4U 3 Now
HP SL 390 “4U
tray”
8 2 4:1 4U 4 Feb 2011
IBM idataplex
(now)
2 2 1:1 2U 1.3
(Non-standard
rack depth)
Now
IBM idataplex w/
SXM (redesign)
4 2 2:1 2U 2.6
(Non-standard
rack depth)
Q4 2011
IBM Blade 1-4 2 Variable:
2:1 max
9U 1.2 Now
Key Systems from Specialist OEMs System Name (or
“codename”)
# of GPUs # of CPU
Sockets
Ratio
(GPU:CPU)
Minimum rack
config
Effective
GPUs/1RU
Ship Date
Appro Tetra 4 2 2:1 1U 4 Now
Appro Hydra 8 2 4:1 2U 4 Feb 2011
Bull Blade 2 2 2:1 7U 2.6 Now
Cray XE6 GPU
blade
2 2 1:1 Full rack 2.2
(custom cabinet)
Q2 2011
NextIO vCORE
2070
4 N/A Same as S2050 2U
(1U for host)
2 Now
Supermicro 1U 2 2 1:1 1U 2 Now
Supermicro
GPU TwinBlade
2 2 1:1 7U 2.8 Now
Tplatforms 2 2 1:1 2U 4.6 Now
Tyan 2U 3 2 3:2 2U 1.5 Now
Tyan 4U 8 2 4:1 4U 2 Now