reducing communication in dense matrix/tensor...

47
Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Reducing communication in dense matrix/tensor computations Edgar Solomonik UC Berkeley Aug 11th, 2011 Edgar Solomonik Communication-avoiding contractions 1/ 44

Upload: others

Post on 31-Oct-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Reducing communication in dense matrix/tensorcomputations

Edgar Solomonik

UC Berkeley

Aug 11th, 2011

Edgar Solomonik Communication-avoiding contractions 1/ 44

Page 2: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Outline

Topology-aware collectivesRectangular collectivesMulticastsReductions

2.5D algorithms2.5D matrix multiplication2.5D LU factorization

Tensor contractionsAlgorithms for distributed tensor contractionsA tensor contraction library implementation

Conclusions and future work

Edgar Solomonik Communication-avoiding contractions 2/ 44

Page 3: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

Performance of multicast (BG/P vs Cray)

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik Communication-avoiding contractions 3/ 44

Page 4: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

Why the performance discrepancy in multicasts?

I Cray machines use binomial multicastsI Form spanning tree from a list of nodesI Route copies of message down each branchI Network contention degrades utilization on a 3D torus

I BG/P uses rectangular multicastsI Require network topology to be a k-ary n-cubeI Form 2n edge-disjoint spanning trees

I Route in different dimensional orderI Use both directions of bidirectional network

Edgar Solomonik Communication-avoiding contractions 4/ 44

Page 5: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

2D rectangular multicasts trees

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

Edgar Solomonik Communication-avoiding contractions 5/ 44

Page 6: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

A model for rectangular multicasts

tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)

Our multicast model consists of 3 terms

1. m/Bn, the bandwidth cost incurred at the root

2. 2(d + 1) · o + 3L, the start-up overhead of setting up themulticasts in all dimensions

3. d · P1/d · (2o + L), the path overhead reflects the time for apacket to get from the root to the farthest destination node

Edgar Solomonik Communication-avoiding contractions 6/ 44

Page 7: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

A model for binomial multicasts

tbnm = log2(P) · (m/Bn + 2o + L)

I The root of the binomial tree sends the entire messagelog2(P) times

I The setup overhead is overlapped with the path overhead

I We assume no contention

Edgar Solomonik Communication-avoiding contractions 7/ 44

Page 8: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

Model verification: one dimension

200

400

600

800

1000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on a ring of 8 nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial

Edgar Solomonik Communication-avoiding contractions 8/ 44

Page 9: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

Model verification: two dimensions

0

500

1000

1500

2000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 64 (8x8) nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial

Edgar Solomonik Communication-avoiding contractions 9/ 44

Page 10: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

Model verification: three dimensions

0

500

1000

1500

2000

2500

3000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 512 (8x8x8) nodes of BG/P

trect modelFaraj et al data

DCMF rectangle dputtbnm model

DCMF binomial

Edgar Solomonik Communication-avoiding contractions 10/ 44

Page 11: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

A model for rectangular reductions

tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)

I Any multicast tree can be inverted to produce a reduction treeI The reduction operator must be applied at each node

I each node operates on 2m dataI both the memory bandwidth and computation cost can be

overlapped

Edgar Solomonik Communication-avoiding contractions 11/ 44

Page 12: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

Rectangular reduction performance on BG/P

50

100

150

200

250

300

350

400

8 64 512

Ban

dwid

th (M

B/s

ec)

#nodes

DCMF Reduce peak bandwidth (largest message size)

torus rectangle ringtorus binomial

torus rectangletorus short binomial

BG/P rectangular reduction performs significantly worse thanmulticast

Edgar Solomonik Communication-avoiding contractions 12/ 44

Page 13: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

Performance of custom line reduction

100

200

300

400

500

600

700

800

512 4096 32768

Ban

dwid

th (M

B/s

ec)

msg size (KB)

Performance of custom Reduce/Multicast on 8 nodes

MPI BroadcastCustom Ring Multicast

Custom Ring Reduce 2Custom Ring Reduce 1

MPI Reduce

Edgar Solomonik Communication-avoiding contractions 13/ 44

Page 14: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Rectangular collectivesMulticastsReductions

Another look at that first plot

Just how much better arerectangular algorithms onP = 4096 nodes?

I Binomial collectives on XE6I 1/30th of link

bandwidth

I Rectangular collectives onBG/P

I 4.3X the link bandwidth

I Over 120X improvementin efficiency!

How can we apply this?

128

256

512

1024

2048

4096

8192

8 64 512 4096

Ban

dwid

th (M

B/s

ec)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Edgar Solomonik Communication-avoiding contractions 14/ 44

Page 15: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

2.5D Cannon-style matrix multiplication0

1

2

3

0 21 3

1

2

3

0

2 03 1

0

1

21 2 0

0

10 10

0

2

32 30

10 10

0

22

+

22

33

00

11

+3D (P=64, c=4)

2.5D (P=32, c=2)

2D (P=16, c=1)

B₀₁

B₁₁

B₂₁

B₃₁

A₂₁A₂₀ A₂₂ A₂₃=

Edgar Solomonik Communication-avoiding contractions 15/ 44

Page 16: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

Classification of parallel dense matrix algorithms

algs c memory (M) words (W ) messages (S)

2D 1 O(n2/P) O(n2/√P) O(

√P)

2.5D [1,P1/3] O(cn2/P) O(n2/√cP) O(

√P/c3)

3D P1/3 O(n2/P2/3) O(n2/P2/3) O(log(P))

NEW: 2.5D algorithms generalize 2D and 3D algrotihms

Edgar Solomonik Communication-avoiding contractions 16/ 44

Page 17: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

Minimize communication with

I minimal memory (2D)

I with as much memory as available (2.5D) - flexible

I with as much memory as the algorithm can exploit (3D)

Match the network topology of

I a√P-by-

√P grid (2D)

I a√

P/c-by-√P/c-by-c grid, most cuboids (2.5D) - flexible

I a P1/3-by-P1/3-by-P1/3 cube (3D)

Edgar Solomonik Communication-avoiding contractions 17/ 44

Page 18: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

2.5D SUMMA-style matrix multiplication

Matrix mapping to 3D partition of BG/P

Edgar Solomonik Communication-avoiding contractions 18/ 44

Page 19: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

2.5D MM strong scaling

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

2.5D MM on BG/P (n=65,536)

2.5D SUMMA2.5D Cannon

2D MM (Cannon)ScaLAPACK PDGEMM

Edgar Solomonik Communication-avoiding contractions 19/ 44

Page 20: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

2.5D MM on 65,536 cores

0

20

40

60

80

100

8192 32768 131072

Per

cent

age

of m

achi

ne p

eak

n

2.5D MM on 16,384 nodes of BG/P

2D SUMMA2D Cannon

2.5D Cannon2.5D SUMMA

Edgar Solomonik Communication-avoiding contractions 20/ 44

Page 21: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

Cost breakdown of MM on 65,536 cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=32768, 2D

n=32768, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e no

rmal

ized

by

2D

SUMMA (2D vs 2.5D) on 16,384 nodes of BG/P

computationidle

communication

Edgar Solomonik Communication-avoiding contractions 21/ 44

Page 22: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

A new latency lower bound for LU

Reduce latency to O(√P/c3) for

LU?I For block size n/d LU does

I Ω(n3/d2) flopsI Ω(n2/d) wordsI Ω(d) msgs

I Now pick d (=latency cost)I d = Ω(

√P) to minimize

flopsI d = Ω(

√c · P) to

minimize words

No dice. But lets minimizebandwidth.

k₁

k₀

k₂

k₃

k₄

k

A₀₀

A₂₂

A₃₃

A₄₄

A

n

n

critical path

d-1,d-1d-1

A₁₁

Edgar Solomonik Communication-avoiding contractions 22/ 44

Page 23: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

2.5D LU factorization without pivoting

2. Perform TRSMs to compute a panel of L and a panel of U.

1. Factorize A₀₀ andcommunicate L₀₀ and U₀₀among layers.

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

3. Broadcast blocks so alllayers own the panelsof L and U.

(A)

(B)

4.Broadcast differentsubpanels within eachlayer.

5.Multiply subpanelson each layer.

6.Reduce (sum) thenext panels.*

U

L

7. Broadcast the panels and continue factorizing the Schur's complement...

* All layers always need to contribute to reductioneven if iteration done with subset of layers.

(C)(D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

Edgar Solomonik Communication-avoiding contractions 23/ 44

Page 24: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

2.5D LU factorization with tournament pivoting

PA₀

LU

LU

L₂₀

L₁₀

L₃₀

L₄₀

3. Pivot rows in first big block column on each layer.

2. Reduce to find best pivot rows.

LU

LU

L U

LU

L₂₀

L₃₀

L₄₀

Update

Update

Update

Update

Update

Update

Update

L₀₀U₀₀

U₀₁

U₀₂

U₀₃

8. Perform TRSMs to compute panel of U

L₃₀

L₁₀L₂₀

1. Factorize each blockin the first column with pivoting.

4. Apply TRSMs tocompute first column of Land the first block of a row of U.

5. Update correspondinginterior blocks S=A-L *U₀₁.

6. Recurse to compute the restof the first big block column of L.

9. Update the restof the matrix as before and recurse on next block panel...

7. Pivot rows in the restof the matrix on each layer.

PLU

PLU

PLU

PLU

PLU P

LU

PLU

PLU

PA₃ PA₂ PA₁

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀

U

U

L

L

L₁₀

U₀₁

U₀₁

U₀₁

U₀₁

L₁₀

L₁₀

L₁₀ L₀₀U₀₀

L₀₀U₀₀

L₀₀U₀₀

k0

PA₀

Edgar Solomonik Communication-avoiding contractions 24/ 44

Page 25: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

2.5D LU strong scaling

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

2.5D LU with on BG/P (n=65,536)

2.5D LU (no-pvt)2.5D LU (CA-pvt)

2D LU (no-pvt)2D LU (CA-pvt)

ScaLAPACK PDGETRF

Edgar Solomonik Communication-avoiding contractions 25/ 44

Page 26: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

2.5D matrix multiplication2.5D LU factorization

2.5D LU on 65,536 cores

0

20

40

60

80

100

2D no-pvt

2.5D no-pvt bnm

2.5D no-pvt rect

2D CA-pvt

2.5D CA-pvt

Tim

e (s

ec)

2.5D LU vs 2D LU on 16,384 nodes of BG/P (n=131,072)

computeidle

communication

Edgar Solomonik Communication-avoiding contractions 26/ 44

Page 27: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Bridging dense linear algebra techniques and applications

Target application: tensor contractions in electronic structurecalculations (quantum chemistry)

I Often memory constrained

I Most target tensors are oddly shaped

I Need support for high dimensional tensors

I Need handling of partial/full tensor symmetries

I Would like to use communication avoiding ideas (blocking,2.5D, topology-awareness)

Edgar Solomonik Communication-avoiding contractions 27/ 44

Page 28: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Decoupling memory usage and topology-awareness

I 2.5D algorithms couple memory usage and virtual topologyI c copies of a matrix implies c processor layers

I Instead, we can nest 2D and/or 2.5D algorithms

I Higher-dimensional algorithms allow smarter topology awaremapping

Edgar Solomonik Communication-avoiding contractions 28/ 44

Page 29: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Higher-dimensional distributed MM

I 2.5D algorithms couple memory usage and virtual topologyI c copies of a matrix implies c processor layers

I Instead, we can nest 2D and/or 2.5D algorithms

I Higher-dimensional algorithms allow smarter topology awaremapping

Edgar Solomonik Communication-avoiding contractions 29/ 44

Page 30: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

4D SUMMA-CannonHow do we map to a 3D partitionwithout using more memory

I SUMMA (bcast-based) on2D layers

I Cannon (send-based) alongthird dimension

I Cannon calls SUMMA assub-routine

I Minimize inefficient(non-rectangular)communication

I Allow better overlap

I Treats MM as a 4D tensorcontraction

0

20

40

60

80

100

4096 8192 16384 32768 65536 131072

Per

cent

age

of fl

ops

peak

matrix dimension

MM on 512 nodes of BG/P

2.5D MM4D MM

2D MM (Cannon)PBLAS MM

Edgar Solomonik Communication-avoiding contractions 30/ 44

Page 31: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Symmetry is a problemI A fully symmetric tensor of dimenson d requires only nd/d!

storageI Symmetry significantly complicates sequential implementation

I Irregular indexing makes alignment and unrolling difficultI Generalizing over all partial-symmetries is expensive

I Blocked or block-cyclic virtual processor decmpositions giveirregular or imbalanced virtual grids

Blocked Block-cyclic

P0 P1

P2 P3

Edgar Solomonik Communication-avoiding contractions 31/ 44

Page 32: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Solving the symmetry problemI A cyclic decomposition allows balanced and regular blocking

of symmetric tensorsI If the cyclic-phase is the same in each symmetric dimension,

each sub-tensor retains the symmetry of the whole tensor

Cyclic

Edgar Solomonik Communication-avoiding contractions 32/ 44

Page 33: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

A generalized cyclic layout is still challenging

I In order to retain partial symmetry, all symmetric dimensionsof a tensor must be mapped with the same cyclic phase

I The contracted dimensions of A and B must be mapped withthe same phase

I And yet the virtual mapping, needs to be mapped to aphysical topology, which can be any shape

Edgar Solomonik Communication-avoiding contractions 33/ 44

Page 34: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Virtual processor grid dimensions

I Our virtual cyclic topology is somewhat restrictive and thephysical topology is very restricted

I Virtual processor grid dimensions serve as a new level ofindirection

I If a tensor dimension must have a certain cyclic phase, adjustphysical mapping by creating a virtual processor dimension

I Allows physical processor grid to be ’stretchable’

Edgar Solomonik Communication-avoiding contractions 34/ 44

Page 35: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Constructing a virtual processor grid for MM

Matrix multiply on 2x3 processor grid. Red lines representvirtualized part of processor grid. Elements assigned to blocks by

cyclic phase.

X =

AB

C

Edgar Solomonik Communication-avoiding contractions 35/ 44

Page 36: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Unfolding the processor grid

I Higher-dimensional fully-symmetric tensors can be mappedonto a lower-dimensional processor grid via creation of newvirtual dimensions

I Lower-dimensional tensors can be mapped onto ahigher-dimensional processor grid via by unfolding (serializing)pairs of processor dimensions

I However, when possible, replication is better than unfolding,since unfolded processor grids can lead to an unbalancedmapping

Edgar Solomonik Communication-avoiding contractions 36/ 44

Page 37: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

A basic parallel algorithm for symmetric tensor contractions

1. Arrange processor grid in any k-ary n-cube shape

2. Map (via unfold & virt) both A and B cyclically along thedimensions being contracted

3. Map (via unfold & virt) the remaining dimensions of A and Bcyclically

4. For each tensor dimension contracted over, recursivelymulitply the tensors along the mapping

I Each contraction dimension is represented with a nested call toa local multiply or a parallel algorithm (e.g. Cannon)

Edgar Solomonik Communication-avoiding contractions 37/ 44

Page 38: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Tensor library structure

The library supports arbitrary-dimensional parallel tensorcontractions with any symmetries on n-cuboid processor toruspartitions

1. Load tensor data by (global rank, value) pairs

2. Once a contraction is defined, map participating tensors

3. Distribute or reshuffle tensor data/pairs

4. Construct contraction algorithm with recursive function/argspointers

5. Contract the sub-tensors with a user-defined sequentialcontract function

6. Output (global rank, value) pairs on request

Edgar Solomonik Communication-avoiding contractions 38/ 44

Page 39: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Current tensor library status

I Dense and symmetric remapping/repadding/contractionsimplemented

I Currently functional only for dense tensors, but with fullsymmetric logic

I Can perform automatic mapping with physical and virtualdimensions, but cannot unfold processor dimensions yet

I Complete library interface implemented, including basicauxillary functions (e.g. map/reduce, sum, etc.)

Edgar Solomonik Communication-avoiding contractions 39/ 44

Page 40: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Next implementation steps

I Currently integrating library with a SCF method code thatuses dense contractions

I Get symmetric redistribution working correctly

I Automatic unfolding of processor dimensions

I Implement mapping by replication to enable 2.5D algorithms

I Much basic performance debugging/optimization left to do

I More optimization needed for sequential symmetriccontractions

Edgar Solomonik Communication-avoiding contractions 40/ 44

Page 41: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Very preliminary contraction library results

Contracts tensors of size 64x64x256x256 in 1 second on 2K nodes

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

Strong scaling of dense contraction on BG/P 64x64x256x256

no rephaserephase every contraction

repad every contraction

Edgar Solomonik Communication-avoiding contractions 41/ 44

Page 42: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Algorithms for distributed tensor contractionsA tensor contraction library implementation

Potential benefit of unfoldingUnfolding smallest two BG/P torus dimensions improvesperformance.

0

20

40

60

80

100

256 512 1024 2048

Per

cent

age

of m

achi

ne p

eak

#nodes

Strong scaling of dense contraction on BG/P 64x64x256x256

no-rephase 2Dno-rephase 3D

Edgar Solomonik Communication-avoiding contractions 42/ 44

Page 43: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Conntributions

I Models for rectangular collectives

I 2.5D algorithms theory and implementation

I Using a cyclic mapping to parallelize symmetric tensorcontractions

I Extending and tuning processor grid with virtual dimensions

I Automatic mapping of high-dimensional tensors totopology-aware physical partitions

I A parallel tensor contraction algorithm/library without aglobal address space

Edgar Solomonik Communication-avoiding contractions 43/ 44

Page 44: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Topology-aware collectives2.5D algorithms

Tensor contractionsConclusions and future work

Conclusions and references

I Parallel tensor contraction algorithm and library seem to bethe first communication-efficient practical approach

I Preliminary results and theory indicate high potential of thistensor contraction library

I papersI (2.5D) to appear in Euro-Par 2011, Distinguished paperI (2.5D + rectangular collective models) to appear in

Supercomputing 2011

Edgar Solomonik Communication-avoiding contractions 44/ 44

Page 45: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Backup slides

Edgar Solomonik Communication-avoiding contractions 45/ 44

Page 46: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

A new LU latency lower bound

k₁

k₀

k₂

k₃

k₄

k

A₀₀

A₂₂

A₃₃

A₄₄

A

n

n

critical path

d-1,d-1d-1

A₁₁

flops lower bound requires d = Ω(√p) blocks/messages

bandwidth lower bound required d = Ω(√cp) blocks/messages

Edgar Solomonik Communication-avoiding contractions 46/ 44

Page 47: Reducing communication in dense matrix/tensor computationsparlab.eecs.berkeley.edu/sites/all/parlab/files/Demmel- Reducing... · Reducing communication in dense matrix/tensor computations

Virtual topology of 2.5D algorithms

(p/c)-1

i

j

k

0

00

(p/c) -11/2

1/2

c-1

2D algorithm mapping:(√

P)×(√

P)

grid

2.5D algorithm mapping:(√

P/c)×(√

P/c)× c grid for any c

Edgar Solomonik Communication-avoiding contractions 47/ 44