“matrix multiply ― in parallel”
DESCRIPTION
“Matrix Multiply ― in parallel”. Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago [email protected]. Background…. Class :“ Introduction to CS for Engineers ” Lang :C/C++ Focus :programming basics, vectors, matrices Timing :present this after introducing 2D arrays…. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: “Matrix Multiply ― in parallel”](https://reader036.vdocuments.us/reader036/viewer/2022062517/568134a8550346895d9bb383/html5/thumbnails/1.jpg)
“Matrix Multiply ― in parallel”
Joe Hummel, PhDU. Of Illinois, Chicago
Loyola University Chicago
![Page 2: “Matrix Multiply ― in parallel”](https://reader036.vdocuments.us/reader036/viewer/2022062517/568134a8550346895d9bb383/html5/thumbnails/2.jpg)
Class: “Introduction to CS for Engineers”
Lang: C/C++
Focus: programming basics, vectors, matrices
Timing: present this after introducing 2D arrays…
Background…
![Page 3: “Matrix Multiply ― in parallel”](https://reader036.vdocuments.us/reader036/viewer/2022062517/568134a8550346895d9bb383/html5/thumbnails/3.jpg)
Yes, it’s boring, but…◦ everyone understands the problem
◦ good example of triply-nested loops
◦ non-trivial computation
Matrix multiply
for (int i = 0; i < N; i++)for (int j = 0; j < N; j++)for (int k = 0; k < N; k++)
C[i][j] += (A[i][k] * B[k][j]);
1500x1500 matrix:
2.25M elements » 32 seconds…
![Page 4: “Matrix Multiply ― in parallel”](https://reader036.vdocuments.us/reader036/viewer/2022062517/568134a8550346895d9bb383/html5/thumbnails/4.jpg)
Matrix multiply is greatcandidate for multicore
◦ embarrassingly-parallel
◦ easy to parallelize viaoutermost loop
Multicore
#pragma omp parallel forfor (int i = 0; i < N; i++)for (int j = 0; j < N; j++)for (int k = 0; k < N; k++)
C[i][j] += (A[i][k] * B[k][j]);
Cores
1500x1500 matrix:
Quad-core CPU » 8 seconds…
![Page 5: “Matrix Multiply ― in parallel”](https://reader036.vdocuments.us/reader036/viewer/2022062517/568134a8550346895d9bb383/html5/thumbnails/5.jpg)
Parallelism alone is not enough…
Designing for HPC
HPC == Parallelism + Memory Hierarchy ─ Contention
Expose parallelism
Maximize data locality:• network• disk• RAM• cache• core
Minimize interaction:• false sharing• locking• synchronization
![Page 6: “Matrix Multiply ― in parallel”](https://reader036.vdocuments.us/reader036/viewer/2022062517/568134a8550346895d9bb383/html5/thumbnails/6.jpg)
What’s the other halfof the chip?
Implications?◦ No one implements MM this way
◦ Rewrite to use loop interchange,and access B row-wise…
Data locality
Cache!
X
#pragma omp parallel for
for (int i = 0; i < N; i++)for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
C[i][j] += (A[i][k] * B[k][j]);
1500x1500 matrix:
Quad-core + cache » 2 seconds…