![Page 1: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/1.jpg)
Introduction to Parallel Computing Using CUDA
Ken Domino, Domem TechnologiesMay 2, 2011
IEEE Boston Continuing Education Program
![Page 2: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/2.jpg)
Time and Location: 6:00 - 8:00 PM, Mondays, May 2, 9, 16, 23
Course Website: http://domemtech.com/ieee-pp
Instructor: Ken Domino, [email protected]
About this course
![Page 3: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/3.jpg)
Recommended Textbooks:
About this course
CUDA by Example: An Introduction to General-Purpose GPU Programming, by J. Sanders and E. Kandrot, ©2010, ISBN 9780131387683
Programming massively parallel processors: A Hands-on approach, by D. Kirk and W. Wen-mei, ©2010, ISBN 9780123814722
![Page 4: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/4.jpg)
Recommended Textbooks:
About this course
Principles of Parallel Programming, by Calvin Lin and Larry Snyder, © 2008, ISBN 9780321487902
Introduction to parallel algorithms, by Xavier, C. and S. Iyengar,© 1998, ISBN 9780471251828
Patterns for Parallel Programming, by Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill, © 2004, ISBN 9780321228116
![Page 5: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/5.jpg)
Other material
Original research papers (see reference list)
About this course
Uzi Vishkin, http://www.umiacs.umd.edu/~vishkin/index.shtmlClass notes on Thinking in Parallel:Some Basic Data-Parallel Algorithms and Techniques, 2010, http://www.umiacs.umd.edu/~vishkin/PUBLICATIONS/classnotes.pdf
![Page 6: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/6.jpg)
Why is Parallel Computing Important?
CPU’s have been getting faster…but that stopped in mid-2000’s. Why?
![Page 7: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/7.jpg)
Why is Parallel Computing Important?
Pollack FJ. New microarchitecture challenges in the coming generations of CMOS process technologies (keynote address)(abstract only). Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture. Haifa, Israel: IEEE Computer Society; 1999:2.
![Page 8: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/8.jpg)
Problems can be solved in much faster times.
Why is Parallel Computing Important?
Predictive protein binding“Meet Tanuki, a 10,000-core Supercomputer in the Cloud”
Based on Amazon EC2 Cloud service, client Genentech
Compute time reduced from about a month to 8 hours
http://www.bio-itworld.com/news/04/25/2011/Meet-Tanuki-ten-thousand-core-supercomputer-in-cloud.html
![Page 9: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/9.jpg)
Computer vision
Why is Parallel Computing Important?
OpenVIDIA: Parallel GPU Computer Vision
Solves problems for Segmentation, Stereo Vision, Optical Flow and Feature Tracking
http://openvidia.sourceforge.net/index.php/OpenVIDIA
http://psychology.wikia.com/wiki/Computer_vision
![Page 10: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/10.jpg)
Army: “Want computers to work like the human brain”
Why is Parallel Computing Important?
http://www.wired.com/dangerroom/2011/04/army-wants-a-computer-that-acts-like-a-brain/
![Page 11: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/11.jpg)
Where are we going?
Why is Parallel Computing Important?
o NVIDIA has Fermi GPU GeForce GTX 590 with 1024-core processor (2011), programmable using CUDA, ~2500 GFLOPS.
o In 2005, Intel started manufacturing dual-core CPU’s.
o In 2010, Intel and AMD are manufacturing six-core CPU’s, ~11 GFLOPS (non-SSE).
o In 2012, Intel will introduce Knights Corner, a 50-core processor.
CPU vs GPU
![Page 12: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/12.jpg)
The largest supercomputer is the Tianhe-IA (Nov 2010 http://www.top500.org/)
7168 Xeon X5670 6-core processors
7168 Nvidia M2050 GPU processors with 448 CUDA Cores
Why CUDA and GPUs?
Wikipedia.org. Tianhe-I, 2010.
![Page 13: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/13.jpg)
One of the “seven” up and coming language [Wayner 2010]
Brings parallel computing to the common man.
For one GPU, a speed ups of 100 times or more over a serial CPU solution is common.
Used in many different applications.
Coming to mobile devices.
Why CUDA and GPUs?
![Page 14: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/14.jpg)
A task is a sequence of instructions that executes as a group.
Tasks continue until halt, exit, or return.
Task
![Page 15: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/15.jpg)
Computers do not directly execute tasks. Computer execute Instructions, which are
used to model a task.
Task
![Page 16: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/16.jpg)
Execution of tasks are not concurrent and not simultaneous.
A sequence of tasks is called a thread.
Sequential
Step 1 Step 3Step
2
![Page 17: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/17.jpg)
Execution of tasks of multiple threads are concurrent but not necessarily simultaneous.
Concurrent
Step 1Step
3Step
2
Step 1 Step 3Step
2
![Page 18: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/18.jpg)
Execution of tasks of multiple threads are concurrent and simultaneously executing on multiple machines.
Goal is minimized time and work.
Parallel
Step 1 Step 3Step
2
Step 1 Step 3Step
2
![Page 19: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/19.jpg)
Nowadays, many people use the terms interchangeably. Lin, Y. and L. Snyder (2009). Why?
Since the tasks of the threads can occur in any order, so behavior is unpredicatable.
Concurrent vs Parallel
Read x Set x = ySet y = x + 1
Read x Set x = ySet y = x + 2
Read x Set x = ySet y = x + 3
![Page 20: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/20.jpg)
CPU – Central Processing Unit
Ld %r1, 1Ld %r2, memSt [%r2], %r1
![Page 21: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/21.jpg)
CPU – Central Processing Unit
Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.
![Page 22: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/22.jpg)
Instruction pipelineAn instruction pipeline is a technique used in the design ofcomputers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).
Unfortunately, not all instructions are independent!
![Page 23: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/23.jpg)
f = 1e = 2a = b = c = d = 3
s1. e = a + bs2. f = c + ds3. g = e * f
Instruction level parallelism
e = a + b
f = c + d
g = e * f
Result: g = 36
![Page 24: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/24.jpg)
f = 1e = 2a = b = c = d = 3
s1. e = a + bs2. f = c + ds3. g = e * f
Instruction level parallelism
e = a + b
f = c + d
g = e * f
Result: g = 6
![Page 25: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/25.jpg)
f = 1e = 2a = b = c = d = 3
s1. e = a + bs2. f = c + ds3. g = e * f
Instruction level parallelism
e = a + bf = c + d
g = e * f“s3 is flow dependent on s1”There are other types of dependencies.
![Page 26: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/26.jpg)
Thread-level parallelism = task parallelism Example, recalculation of a spreadsheet
Where else to find parallelism?
![Page 27: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/27.jpg)
Process-level parallelism Example, two independent programs
(Freecell and Email)
Granularity is the size of the problem (e.g., instruction vs. thread vs. process)
Where else to find parallelism?
![Page 28: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/28.jpg)
What is speed up?
p = number of processors
Speed up
![Page 29: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/29.jpg)
Speed up - ExampleTa Tb Tc Td Te
Ta
Tb
Tc
Td
Te
What is the time of computation if b, c, d, e are tasks that can be run in parallel on four processors?
Serial computation: Ta … Te
![Page 30: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/30.jpg)
Amdahl’s law
Maximum speed upTa Tb Tc Td Te
Tb
Tc
Td
Te
Ta
Let f = fraction of time that must be serially executed
![Page 31: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/31.jpg)
Amdahl’s law
Maximum speed up
0 1000 20000
500
1000
1500
2000
2500
f=0.1f=0.05f=0.01f=0.001f=0
p
Sp
eed
up
Question: If a problem is not parallelizable by even only a small fraction, throwing more processors at a problem will not help speed it up. So, why try for a parallel solution?
Paradox
Gustafson?
(G.’s Law is equivalent to Amdahl…)
![Page 32: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/32.jpg)
Question: If a problem is not parallelizable by even only a small fraction, throwing more processors at a problem will not help speed it up. So, why try for a parallel solution?
Answer: A prerequisite to applying Amdahl’s or Gustafson’s formulation is that the serial and parallel programs take the same number of total calculation steps for the same input.
Maximum speed up
![Page 33: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/33.jpg)
Use of a resource constrained serial execution as the base for speedup calculation; and
Use a parallel implementation that can bypass large amount of calculation steps while yield the same output of the corresponding serial algorithm.
Any algorithm in which the complexity of verification is faster than the complexity of the solution [Shi 1995] => most algorithms!
Breaking Amdahl’s Law
![Page 34: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/34.jpg)
CPU = “Central Processing Unit” GPU = “Graphics Processing Unit” What’s the difference?
CPU vs GPU
![Page 35: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/35.jpg)
Why do we classify hardware? In order to program a parallel computer,
you have to understand the hardware very well.
The basic classification is Flynn taxonomy (1966): SISD, SIMD, MIMD, MISD
Hardware Classification
![Page 36: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/36.jpg)
Single Instruction Single Data
SISD
Examples:MOS Technology 6502Motorola 68000Intel 8086
![Page 37: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/37.jpg)
Single Instruction Multiple Data
SIMD
Examples:ILLIAC IVCM-1, -2Intel Core, AtomNVIDIA GPU’s
![Page 38: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/38.jpg)
MISD Multiple Instruction
Single Data
Examples:Space shuttle computer
![Page 39: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/39.jpg)
MIMD Multiple Instruction
Multiple Data
Examples:BBN Butterfly, Cedar, CM-5, IBM RP3,Intel Cube, Ncube, NYU Ultracomputer
![Page 40: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/40.jpg)
Parallel Random Access Machine (PRAM). Idealized SIMD parallel computing model.
PRAM
Unlimited RAM’s, called Processing Units (PU). RAM’s operate with same instructions and synchronously. Shared Memory unlimited, accessed in one unit time. Shared Memory access is one of CREW, CRCW, EREW. Communication between RAM’s is only through Shared Memory.
![Page 41: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/41.jpg)
PRAM is used for specifying an algorithm and analyzing the complexity of it.
PRAM-based algorithms can be adapted to SIMD architectures.
PRAM algorithms can be converted into CUDA implementations relatively easily.
Why is PRAM important?
![Page 42: Ken Domino, Domem Technologies May 2, 2011 IEEE Boston Continuing Education Program](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649cb15503460f94976300/html5/thumbnails/42.jpg)
Parallel for loop for Pi , 1 ≤ i ≤ n in parallel do … end
PRAM pseudo code