ks blis retreat 15f - cs.utexas.edu...compute kw with gemv or select k entries in each row. rely on...

24
G S K NN BLIS-Based High Performance Computing Kernels in N-body Problems Chenhan D. Yu Copyright @ 2015, The University of Texas at Austin The 3rd BLIS Retreat Sep 28, 201 5 GSKS

Upload: others

Post on 22-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

GSKNNBLIS-Based High Performance Computing Kernels

in N-body Problems

Chenhan D. Yu

Copyright @ 2015, The University of Texas at Austin

The 3rd BLIS RetreatSep 28, 2015

GSKS

Page 2: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

N-body Problems

2Copyright @ 2015, The University of Texas at Austinhttps://youtu.be/bLLWkx_MRfkHellstorm Astronomy and 3D

Page 3: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

N-body ProblemsN-body problems aim to describe the interaction (relation) of N points { X } in a d dimensional space.

K(xi, xj) = Kij describes the interaction between xi and xj.

3 operations: Kernel Summation u=Kw, Kernel Inversion w=(K+λI)-1u and Nearest-Neighbors.

2D and 3D applications can be found in computational physics, geophysical exploration and medical imaging.

High dimension applications in computational statistic include clustering, classification and regression.

3Copyright @ 2015, The University of Texas at Austin

Page 4: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Gflo

ps

55

110

165

220

Dimension

4 20 36 68 132 260 516 1028

MKL+STL GSKNN

16 Nearest Neighbors

10 cores Ivy-Bridge, 8192 points

50%

80%

Copyright @ 2015, The University of Texas at Austin

8192x20x8192 DGEMM

Page 5: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

OutlineKernel Summation (u=Kw) and Nearest-Neighbors.

How GEMM is applied in the conventional approach?

Why GEMM can be memory bound in these operations?

What insight is required to design an algorithm that avoids redundant memory operations but still preserves the efficiency?

How GSKS and GSKNN are inspired by the BLIS framework in their design?

5Copyright @ 2015, The University of Texas at Austin

Page 6: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Linear Kernel:

6Copyright @ 2015, The University of Texas at Austin

(1,1)

(0,0) (1,0)

1 1

0 0

1 0

1 0 1

1 0 0

R

QT = R

2 0 1

0 0 0

1 0 1

K = QTR

x1T

x2T

x3T

x1 x2 x3

K(xi, xj) = xiTxj

Kw3

0

2

x2 x3

x1 x2

x2 x1

Kernel Summation

Nearest-Neighbors

x1

x3x2

2 0 1

0 0 0

1 0 1

2 0 1

0 0 0

1 0 1

1

1

1x =

Page 7: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Other KernelsK(xi, xj) = f(xiTxj), e.g. Gaussian kernel

7Copyright @ 2015, The University of Texas at Austin

39:2 C. D. Yu et al.

grows exponentially with k. For high dimensional data analysis problems such as ker-nel regression and kernel classification, schemes are usually linear or superlinear tok for the scalability. For example, ASKIT [] takes an nearest-neighbor pruning and aspatial tree to improve the scalability.

Runtime evaluating instead of reusing K significantly increases the time complex-ity especially when k is large. How to compute a dense K and it’s summation Kw effi-ciently is a new bottleneck for all applications that requires the operation. For example,the floating efficiency of LIBSVM [Chang and Lin 2011] doesn’t scale with k, since thekernel value is computed element-wise without taking the advantage of the moderncomputer architecture. Without the insight of the computer architecture, most of thesoftware written in pure interpreting or partially compiled language are usually run-ning with in 3% of the CPU peak performance. Even a pure C/Fortran program, 10%is usually the average. One of the solution is to take the advantage of the high perfor-mance level 3 BLAS (Basic Linear Algebra Subroutines) to compute a submatrix of K.K(x

i

, xj

) is usually a function of the square distance kxi

� xj

k22, and its expansion (2)is mainly a pairwise inner-product xT

i

xj

.

kxi

� xj

k22 = kxi

k22 + kxj

k22 � 2xT

i

xj

(2)

With expression (2), computing K can rely on a high performance matrix-matrixmultiplication (GEMM) which can usually reach more than 80% peak performance on themodern CPU architectures with a large k. Level 3 BLAS routines such as GEMM arehighly optimized by domain experts, and the interfaces are standardized to facilitateusers. The kernel value can be accelerated by vectorized math functions (e.g. IntelVectorized Math Library), and Kw can be computed by GEMV (General Matrix-VectorMultiplication) which is also a level 2 subroutine of BLAS. CUKNN [Liang et al. 2009]computes pairwise distance with (2) to accelerate searching for k-nearest neighbors,and the same approach is taken in ASKIT [March et al. 2014] [March et al. 2015] atreecode approach for fast high dimensional kernel summation in both near-field andfar-field evaluation.

Although the BLAS approach works reasonably well on large k, yet there are stillseveral drawbacks remaining unsolved.

(1) The BLAS approach transform the low dimensional case of (1) from a computationbound problem to memory bound.

(2) Since GEMM is a memory bound problem for small k. Thus, GEMM needs a large enoughk to reach high performance, yet most of the practical problems have k 32.

(3) To compute a subproblem of (1), coordinates need to be collected to form dense Aand B in order to use GEMM, requiring extra memory space.

(4) The intermediate results �2xT

i

xj

need to be stored as a dense matrix C whichrequires extra memory space.

(5) The temporary spaces A, B and C are also accompanied with extra memory access,suffering from a serious penalty.

(6) GEMV is a memory bound operation which can hardly reach the peak floating pointperformance.

To summarize the BLAS approach, the standardized interface of BLAS limits the pos-sibility of combining different operations. A new BLAS like subroutine to compute thekernel summation is inspired by the drawbacks listed above, combining GEMM, GEMV andvectorized math functions together to exploit the modern CPU architecture.

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:4 C. D. Yu et al.

ALGORITHM 1: Kernel Summation with GEMM, VEXP and GEMVX

A

(:,↵) ! As

, kXA

k22(↵) ! kAs

k22, u(↵) ! us

;X

B

(:,�) ! Bs

, kXB

k22(�) ! kBs

k22, w(!) ! ws

;GEMM(A

s

,�2.0,Bs

, 0.0, Cs

);for j=0:n-1 do

for i=0:m-1 doCs

(i, j)+ = kAs

k22(i) + kBs

k22(j);Cs

(i, j)⇤ = (�1/2h2);

endendVEXP(C

s

);GEMV(C

s

, 1.0, ws

, 1.0, uw

);us

! u(↵);

same map as Bs

. For example, ASKIT creates skeleton weights w̃ for approximation;thus, here we use w

s

= w(!s

) to take care this special situation.Given the notation above, (1) can be approximate by (3), and GSKS is designed a solve

a dense kernel summation.

u =

X

s

us

=

X

s

K(↵s

,�s

)w(!s

) (3)

Each elements of K(↵s

,�s

) is usually a function of the square distance kxi

�xj

k22 wherexi

2 As

and xj

2 Bs

. To be more concise, we take Gaussian kernel as an example forthe rest of the article, and the kernel function is written as:

K(xi

, xj

) = exp(�kxi

� xj

k22/(2h2)) (4)

where h is the width of the kernel. The square distance can be evaluated directly orby (2). Precomputing kx

i

k22 = kXA

k22(i) and reusing the results of kXA

k22 and kXB

k22requires many fewer FMA (Fused Multiply Add) operations than evaluating kx

i

�xj

k22directly. Moreover, �2xT

i

xj

, (4) and (3) can be computed by GEMM, GEMV and VEXP (vec-torized exponential function) which provides an opportunity to achieve excellent per-formance on modern CPU architectures.

The GEMM, VEXP and GEMV combination approach is widely used in the modern kernelmethods to achieve high a performance with the BLAS approach. We summarize thecombination approach in Algorithm 1 by using the same notations we just defined. Thedrawback of this approach is that A

s

, Bs

and Cs

need to be formed explicitly in orderto use GEMM, VEXP and GEMV due to the standardized BLAS interface. A

s

and Bs

arecreated to collect points from X

A

and XB

, since GEMM only takes contiguous or uniformstride inputs. C

s

must be created to output the result of GEMM, and it’s also required byVEXP and GEMV. These temporary spaces are redundant, and the extra memory accessesare accompanied. Inspired by the redundant memory operation, we develop a BLASlike subroutine which embeds Algorithm 1 into a micro-kernel which may avoid theseredundant memory allocations and operations.2.1. Sequencial General Stride Kernel SummationWe first present the pseudo-code of GSKS with Gaussian kernel in Algorithm 2 for com-puting a subproblem of (3). Other than GSKS, GEKS is a case of GSKS stands for thegeneral storage version where ↵(i) = i, �(j) = j and !(j) = j. The algorithm contains6 layers of loops which are corresponding to different partitioning of m, n and k. Thepartitioning scheme is similar to the GEMM implementation in BLIS [Van Zee and Van

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

The expansion exposes GEMM operations:

GEMM

1 1

0 0

1 0

2 0 1

0 0 0

1 0 1

x1T

x2T

x3T

2

0

1

x1Tx1

x2Tx2

x3Tx3

1+2-2*1K = QTRQT

Page 8: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

The Big Picture

8

Kw takes O(N2) if K is precomputed, otherwise O(dN2). The cost is too expensive when N is large.

Exhaustive search requires O(N2log(k)) if K is precomputed, otherwise O(dN2+N2log(k)).

Divide-and-conquer approximation: Barnes-Hut or FMM for kernel summation, and randomized KD-tree or locality sensitive hashing for kNN.

Still the subproblem of all these algorithms is to solve several smaller dense kernel summation or kNN.

Solving the subproblem fast benefits all these methods.

Copyright @ 2015, The University of Texas at Austin

Page 9: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Subproblem

9

Take two subsets Q and R from X.

Compute K(Q,R) with GEMM using:

Compute Kw with GEMV or select k entries in each row.

Rely on BLAS, VML (Vectorized Math Library) and STL.

What can possibly go wrong?

39:2 C. D. Yu et al.

grows exponentially with k. For high dimensional data analysis problems such as ker-nel regression and kernel classification, schemes are usually linear or superlinear tok for the scalability. For example, ASKIT [] takes an nearest-neighbor pruning and aspatial tree to improve the scalability.

Runtime evaluating instead of reusing K significantly increases the time complex-ity especially when k is large. How to compute a dense K and it’s summation Kw effi-ciently is a new bottleneck for all applications that requires the operation. For example,the floating efficiency of LIBSVM [Chang and Lin 2011] doesn’t scale with k, since thekernel value is computed element-wise without taking the advantage of the moderncomputer architecture. Without the insight of the computer architecture, most of thesoftware written in pure interpreting or partially compiled language are usually run-ning with in 3% of the CPU peak performance. Even a pure C/Fortran program, 10%is usually the average. One of the solution is to take the advantage of the high perfor-mance level 3 BLAS (Basic Linear Algebra Subroutines) to compute a submatrix of K.K(x

i

, xj

) is usually a function of the square distance kxi

� xj

k22, and its expansion (2)is mainly a pairwise inner-product xT

i

xj

.

kxi

� xj

k22 = kxi

k22 + kxj

k22 � 2xT

i

xj

(2)

With expression (2), computing K can rely on a high performance matrix-matrixmultiplication (GEMM) which can usually reach more than 80% peak performance on themodern CPU architectures with a large k. Level 3 BLAS routines such as GEMM arehighly optimized by domain experts, and the interfaces are standardized to facilitateusers. The kernel value can be accelerated by vectorized math functions (e.g. IntelVectorized Math Library), and Kw can be computed by GEMV (General Matrix-VectorMultiplication) which is also a level 2 subroutine of BLAS. CUKNN [Liang et al. 2009]computes pairwise distance with (2) to accelerate searching for k-nearest neighbors,and the same approach is taken in ASKIT [March et al. 2014] [March et al. 2015] atreecode approach for fast high dimensional kernel summation in both near-field andfar-field evaluation.

Although the BLAS approach works reasonably well on large k, yet there are stillseveral drawbacks remaining unsolved.

(1) The BLAS approach transform the low dimensional case of (1) from a computationbound problem to memory bound.

(2) Since GEMM is a memory bound problem for small k. Thus, GEMM needs a large enoughk to reach high performance, yet most of the practical problems have k 32.

(3) To compute a subproblem of (1), coordinates need to be collected to form dense Aand B in order to use GEMM, requiring extra memory space.

(4) The intermediate results �2xT

i

xj

need to be stored as a dense matrix C whichrequires extra memory space.

(5) The temporary spaces A, B and C are also accompanied with extra memory access,suffering from a serious penalty.

(6) GEMV is a memory bound operation which can hardly reach the peak floating pointperformance.

To summarize the BLAS approach, the standardized interface of BLAS limits the pos-sibility of combining different operations. A new BLAS like subroutine to compute thekernel summation is inspired by the drawbacks listed above, combining GEMM, GEMV andvectorized math functions together to exploit the modern CPU architecture.

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

Copyright @ 2015, The University of Texas at Austin

Gflo

ps

55110165220

4 36 132 516

MKL+STLGSKNN

Page 10: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Visualization

10

X d

R

m

d n

d

QT

Km

n

x =

N

KSmx1

KNNmxk

Copyright @ 2015, The University of Texas at Austin

Page 11: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Insights

11

Q, R and K can’t be stored.

Collet Q and R from X during packing.

K(xi, xj) = Kij must be computed in registers.

Kw or k-select must be completed in registers.

Only store the output.

We need a special packing routine.

Fuse GEMM with distance calculations, special function evaluations, Kw or k-select.

Copyright @ 2015, The University of Texas at Austin

i.e.

Page 12: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Code Fusion in BLIS

12Copyright @ 2015, The University of Texas at Austin

K

R

QT

q

r

d n

m

d

Nm

k

n update

neighbor lists6th loop: nc

4th

loop

: mc

mr

nr

Code fusion is done in micro-kernel, and the BLIS framework is maintained.

Slice and Dice!

Page 13: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

GSKNN and BLIS (K=QTR)

13

Pack RC

Pack QC

Reuse RC

Stream QC

& RC2

& RC2

& QC2

& QC2

Micro-Kernel

Page 14: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Micro-Kernel

14Copyright @ 2015, The University of Texas at Austin

R0 R1 R3R2

Q0

Q1

Q2

Q3

00

11

22

33

LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3

Page 15: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Micro-Kernel

15Copyright @ 2015, The University of Texas at Austin

R0 R1 R3R2

00

11

22

33

Q0

Q1

Q2

Q3

LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3

23

01

10

32

Page 16: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Micro-Kernel

16Copyright @ 2015, The University of Texas at Austin

LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3

R1 R0 R2R3

00

11

22

33

Q0

Q1

Q2

Q3

03

12

21

30

23

01

10

32

Page 17: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Micro-Kernel

17Copyright @ 2015, The University of Texas at Austin

00

11

22

33

03

12

21

30

Q0

Q1

Q2

Q3

LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3

02

13

20

31

23

01

10

32

R3 R2 R0R1

Page 18: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Micro-Kernel with p-norm

18Copyright @ 2015, The University of Texas at Austin

LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3

1-norm SUB AND (flip signed bit) ADD

inf-norm SUB AND (flip signed bit) MAX

p-norm SUB POW (SVML) ADD

Page 19: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Vectorized Math Functions

19Copyright @ 2015, The University of Texas at Austin

p(x)-exp(x)

00 ln2

With a high precision (20 digits in decimal), Remez exchange algorithm can generate an 11 order near minimax polynomial with 1E-18 relative error.

39:16 C. D. Yu et al.

To derive the backward stability, we move the error term into the square operation.(1 + ✏)k+2kx

i

� xj

k22 = k(1 + ✏)k+22 x

i

� (1 + ✏)k+22 x

j

k22.

k(1 + ✏)k+22 x

i

� xi

k2kx

i

k2 = (1 + ✏)k+2 � 1 O(k✏) (13)

The expansion computes the square pairwise distance more efficiently by reusing theresult of kX

A

k22 and kXB

k22. Similar to the direct evaluation, both schemes are backwardstable, and the round-off error is also the same.

In the polynomial approximation part, the roundoff error mainly comes from thenested polynomial evaluation.

P11(x) = c11 + (...+ (c5 + (c4 + (c3 + (c2 + (c1 + c0x)x)x)x)x)x...)x (14)

The roundoff error of the order-n (n � 1) polynomial summation has the followingclosed form: c

n

xn

(1 + ✏)2n +

P0i=n�1 cix

i

(1 + ✏)2i+1. This polynomial approximation isforward stable, since exp(b) � 1 for b 2 in[0, ln 2]. The forward stability is derived in(15) and (16).

|cn

xn

[(1 + ✏)2n � 1] +

P0i=n�1 cix

i

[(1 + ✏)2i+1 � 1]||P0

i=n

ci

xi| (15)

|2n✏||P0i=n

ci

xi|+O(✏2)

|P0i=n

ci

xi| = O(n✏) (16)

x0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

P(x

) -

exp

(x)

×10-16

-3

-2

-1

0

1

2

3

4

5

Remez order 11Intel Vdexp()Double Machine Epsilon

Fig. 5: Error comparison between Remez order eleven polynomial approximation and Intel vdExp() function:the polynomial is chosen to fit the exponential function with [0, log(2)], and the converging criteria for theRemez exchange algorithm is the double machine epsilon (2.22E-16).

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

= 1ADD + 11FMA

Page 20: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Vectorized Max Heap

20Copyright @ 2015, The University of Texas at Austin

LOAD C SHUFFLE D, C, 0x5 MAX D, C PERMUTE2F128 C, D, 0x1 MAX D, C

C0 C1 C3C2

H

Find the max childC:[1,3,4,2]->[4,3,4,3]->[4,4,4,4] D:[4,2,1,3] [3,4,3,4]

5

… k values

Q

3

Page 21: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

GSKS Efficiency Analysis

21Copyright @ 2015, The University of Texas at Austin

TBLAS=TGSKS+TR+TQ+TK

TGSKS

mn(2d+36)TBLAS

mn(2d+36)- =

?

Page 22: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

GSKNN Efficiency Graphs

22Copyright @ 2015, The University of Texas at Austin

d0 200 400 600 800 1000

GF

LO

PS

0

5

10

15

22

28.32Variant#1, m=n=8192, k=16

d0 200 400 600 800 1000

GF

LO

PS

0

50

100

150

200

248#1, nthd=10, m=n=8192, k=16

d0 200 400 600 800 1000

GF

LO

PS

0

5

10

15

22

28.32Variant#1, m=n=8192, k=512

d0 200 400 600 800 1000

GF

LO

PS

0

50

100

150

200

248#1, nthd=10, m=n=8192, k=512

d0 200 400 600 800 1000

GF

LO

PS

0

5

10

15

22

28.32Variant#6, m=n=8192, k=2048

d0 200 400 600 800 1000

GF

LO

PS

0

50

100

150

200

248#6, nthd=10, m=n=8192, k=2048

Figure 4 Predicted floating point e�ciency (Gflops). Sequentialparameters: m = n = 8192, k = 16, 512, 2048, ⌧fp = 8 ⇥ 3.54,

⌧cm = 2.2 ⇥ 10�9, ⌧rm = 13.91 ⇥ 10�9, ✏ = 0.5. For the 10-threadresult, ⌧fp = 10⇥8⇥3.10, ⌧cm and ⌧rm are 1

5 to the original value.

real experimented switching point. The predicted switch-ing point can significantly reduces the time spending on finetuning the switching point.

k0 500 1000 1500 2000

GF

LO

PS

16

32

64

128

nthd=10, m=n=8192, d=16

Modeled Variant#1Modeled Variant#6Variant#1Variant#6MKL + STLModeled thresholdThreshold

k0 500 1000 1500 2000

GF

LO

PS

32

64

128

nthd=10, m=n=8192, d=64

Modeled Variant#1Modeled Variant#6Variant#1Variant#6MKL + STLModeled thresholdThreshold

Figure 5 Predicted floating point e�ciency (Gflops) for di↵erentk.

4. EXPERIMENTAL SETUPIn this section we give details on the experimental setup

used to test our methods. Our current version of GSKNN con-tains double precision x86-64 micro-kernels designed for IntelSandy-Bridge/Ivy-Bridge architectures. GSKNN can and hasbeen integrated with other MPI based parallel knn imple-mentations such as randomized KD-trees and locality sen-sitive hashing which typically use a BLAS approach to thelocal search.

Implementation and hardware: Our GSKNN routine isimplemented in C, SSE2 and AVX intrinsic or assembly. Otherthan the micro-kernel, all other parts are written in pure C.The parallel randomized KD-tree knn is written in C++ andSSE4.2. The code is compiled with Intel C compiler version14.0.1.106 and mvapich2 version 2.0b with the -O3 optimiza-tion flag. We carry out runtime experiments on the Mavericksystem at TACC which has two ten-core CPUs. The dual-CPUs in each node are Intel Xeon E5-2680 v2(Ivy Bridge)processors (2.8GHz/3.6GHz) with 12.8Gb/core of memory

and a three-level cache: 32KB L1 data cache, 256KB L2cache and 25.6MB L3 cache. The stable CPU clockrate is3.54GHz/3.10GHz for 1/10 cores experiments.GSKNN parameters: We choose parameters as discussed

in §3.4 where mr

= 8, nr

= 4 and kc

= 256. mc

= 104 andnc

= 4096, which make the size of Ac 208 KB and the size ofBc 8192 KB. For all experiments with k 512, Variant#1is chosen, otherwise Variant#6 is used instead.

5. NUMERICAL RESULTSWe have shown and discussed our sequential design in

§3.3, and here we report three sets of results: (1) breakdownanalysis, (2) multi-threaded floating e�ciency and (3) theintegrated runtime of the randomize KD-tree knn solver. Allexperiments are in double precision, and each result has areference kernel implementing Algorithm 3.1 using MKL GEMM

and STL heap.

MKL+STL / GSKNN m = n = 8192, d = 16

k Tcoll

+ TGEMM

+ Tsq2d T

heap

Ttotal

16 0 + 55 + 24 / 20 13 / 1 92 / 21128 0 + 55 + 24 / 20 16 / 5 95 / 25512 0 + 55 + 24 / 20 30 / 33 109 / 53

2048 0 + 55 + 24 / 76 52 / 34 131 / 110

MKL+STL / GSKNN m = n = 8192, d = 64

16 1 + 117 + 24 / 52 13 / 1 155 / 53128 1 + 122 + 24 / 52 15 / 6 162 / 58512 1 + 113 + 24 / 52 30 / 35 168 / 87

2048 1 + 126 + 24 / 94 52 / 34 203 / 128

MKL+STL / GSKNN m = n = 8192, d = 256

16 3 + 210 + 24 / 186 13 / 2 250 / 188128 3 + 209 + 24 / 186 15 / 13 251 / 199512 3 + 211 + 24 / 186 30 / 38 268 / 224

2048 3 + 213 + 24 / 202 52 / 34 292 / 236

MKL+STL / GSKNN m = n = 8192, d = 1024

16 9 + 702 + 24 / 665 13 / 0 748 / 665128 9 + 734 + 24 / 665 15 / 11 782 / 676512 9 + 728 + 24 / 665 30 / 40 791 / 705

2048 9 + 735 + 24 / 673 51 / 34 819 / 707

Table 6 Runtime breakdown analysis (ms): for k = 16, 128, 512the Variant#1 GSKNN is used, and for k = 2048 Variant#6 is usedinstead.

We breakdown the total execute time Ttotal

= Tcoll

+TGEMM

+ Tsq2d + T

heap

which represents the time spending oncollecting data from the global table X , X2, computing GEMM,evaluating the square distance and the heap selection. ForGSKNN, the time spending on individual terms are di�cult tocollective, since the timer will lead to a serious overhead in-side the 2nd loop. Thus, we report integrated time of GSKNN,and we estimate the time spending on heap by evaluatingthe total execution di↵erence with the k = 1 case . Tak-ing the first row (k = 16) of Table 6 as an example, GSKNNspends 21 ms in total. The estimated heap selection timeis computed by 21� 20 = 1 where 20 is the total executiontime of the case k = 1. The breakdown results reflect thedi↵erence of (??) on the memory complexity, and we arecertain that the optimization lead to a smaller coe�cient onthe heap selection part. The performance degrading of Vari-ant#1 on a larger k can be observed in T

heap

, and we switchto Variant#6 for k = 2048 to secure the GEMM e�ciency.

Mem

ory

Bou

nd

Page 23: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Conclusion

23

The GEMM approach in N-body problems is a good example to show the current BLAS library is lacking flexibility for lower level integration.

The algorithmic innovation of GSKS and GSKNN is to break through the interface, seeking for the lowest memory complexity.

We exploit these observations with the help of the BLIS framework.

Ongoing work includes other operations. e.g. kernel inversion, k-meaning clustering. Port to GPU and other accelerators.

Copyright @ 2015, The University of Texas at Austin

Page 24: ks blis retreat 15f - cs.utexas.edu...Compute Kw with GEMV or select k entries in each row. Rely on BLAS, VML (Vectorized Math Library) and STL. What can possibly go wrong? 39:2 C

Question?

24Copyright @ 2015, The University of Texas at Austin

GSKS GSKNNgithub.com/ChenhanYu/rnngithub.com/ChenhanYu/ks