towards auto-tuning framework for numerical libraries

36
Takahiro Katagiri Information Technology Center, The University of Tokyo 1 First French-Japanese Workshop - Petascale Application, Algorithms and Programming (PAAP) - December 1st, 2007, 2:10pm – 2:40pm

Upload: eliot

Post on 13-Jan-2016

40 views

Category:

Documents


2 download

DESCRIPTION

Towards Auto-tuning Framework for Numerical Libraries. Takahiro Katagiri Information Technology Center, The University of Tokyo. First French-Japanese Workshop - Petascale Application, Algorithms and Programming (PAAP) - December 1st, 2007, 2:10pm – 2:40pm. Outlines. Motivation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards  Auto-tuning Framework  for Numerical Libraries

Takahiro KatagiriInformation Technology Center,

The University of Tokyo

1

First French-Japanese Workshop - Petascale Application, Algorithms and Programming (PAAP) -December 1st, 2007, 2:10pm – 2:40pm

Page 2: Towards  Auto-tuning Framework  for Numerical Libraries

Motivation Our Solutions

FIBER : An Auto-tuning FrameworkABCLibScript: An Auto-tuning

Description LanguageABCLib: A Library with Auto-tuning

Facility ABCLib_DRSSED: An Eigenvalue Solver MS-MPI Run-time Auto-tuning Project

Related Projects Conclusion Remarks

2

Page 3: Towards  Auto-tuning Framework  for Numerical Libraries

To establish high productivity on numerical software

3

Page 4: Towards  Auto-tuning Framework  for Numerical Libraries

4

Why so high cost?1. Explosion of search space for tuning parameters

Excessive development processes 2. Tuning is not science, but craftspeople work…

Excessive personnel costs

1. Excessive development processesMany algorithm parameters

Preconditioner, restart frequency, block algorithm length, …

Complex current computer architectures multicore, unsymmetrical memory access,…

2. Excessive personnel costs Intricate high performance

implementations Craftspeople only can do it.

Compilers do not work well on the complex current computers….

Page 5: Towards  Auto-tuning Framework  for Numerical Libraries

5

Averaged gap: 10x. Dedicated sizes: 100x.How should we manage it?

Tim

e in

S

eco

nd

s

No Unrolling(Compiler optimization)

•Unrolled coeds for matrix-matrix multiplication with nested 3 loops (i,j,k) from 1 to 4.

•The variation is 4*4*4=64 kinds. •For matrix size N, it varies from 1 to 2048 stridden 1.

•Compiler: HITACHI Optimized Fortran90. Option: -Oss with automatically parallelization.•Machine: HITACHI SR11000/J2 Model installed in Information Technology Center, The University of Tokyo. It has 16PEs per node.

Page 6: Towards  Auto-tuning Framework  for Numerical Libraries

1. To reduce tuning processes: Automation of tuning can reduce the Automation of tuning can reduce the

tuning process to hand-tuning.tuning process to hand-tuning. Tuning is time-consuming work even in craftsman.

Writing complicated codes. Troublesome test-run to tune

2. To reduce personnel cost: ““Automatic Tuning Recipe” makes Automatic Tuning Recipe” makes

tuning tuning non-expert work.non-expert work.

Software Framework Auto-tuning facility Computer language for non-expert developers Source code generator

Tuning object codes and tuning control codes6

Page 7: Towards  Auto-tuning Framework  for Numerical Libraries

FIBER, ABCLib, ABCLibScript

7

Page 8: Towards  Auto-tuning Framework  for Numerical Libraries

8

…Operating SystemsOperating Systems

Communication Communication Libraries (MPI)Libraries (MPI)

Linear

EquationsSolvers

Eigenvalue

Solvers

Library InterfaceLibrary Interface

HITACHI SR Fujitsu VPP NEC SXPC Clusters

Auto-tuningAuto-tuningFacilityFacility   

Auto-modeling Funct.Auto-modeling Funct.

Code generation Funct.Code generation Funct.

Parameter Opt. Funct.Parameter Opt. Funct.

BLASPerformanceParameters

OptimizationCodes & Info.

ImplementationInfo.

Scheduling & Computer Info.

CompilersCompilers

SparseDirect

Solvers

Page 9: Towards  Auto-tuning Framework  for Numerical Libraries

FIBER:

11

6

Operating Systems

Communication Libraries (MPI)

LinearEquationsSolvers

Eigenvalue Solvers

Library Interface

HITACHI SR Fujitsu VPP NEC SXPC Clusters

Auto-tuningFacility

Auto-modeling Funct.

Code generation Funct.

Parameter Opt. Funct.

BLAS PerformanceParameters

OptimizationCodes & Info.

ImplementationInfo.

Scheduling & Computer Info.

Compilers

SparseDirectSolvers

Page 10: Towards  Auto-tuning Framework  for Numerical Libraries

FIBER (Framework for Install-time, Before Execute-time and Run-time auto-tuning) Paradigm FIBER paradigm is a methodology for auto-tuning software

to generalize application and obtain high accuracy for estimated parameters.

How Auto-tuning is performed:(a) Parameters that affect performance are

extracted(b) The parameters are automatically optimized

(a) Parameter extraction:by users utilizing a dedicated language

(ABCLibScript ABCLibScript ) (b) Parameter optimization:

three kinds of optimization layersusing statistical methods

12

Page 11: Towards  Auto-tuning Framework  for Numerical Libraries

15

Library Developers

Specified by library developersIncludes instructions for optimizationIndependence of computer environments

Develop the codes using ABCLibScript

Source codesincluding auto-tuning facilities

Loop unrolled codeAlgorithm (sub-routine) selection codeParameter optimization functionParameter search function

Release library to the public

Execute pre-processor( ABCLibCodeGen )

Page 12: Towards  Auto-tuning Framework  for Numerical Libraries

Install-time Optimization

16

End-users

Estimated best unrolling depthEstimated best block lengthGenerated library object

Specified tuned parameters

Install the released library into user’s machine environment

(FIBER install-time optimization is performed )

Debugging and Application Developments Debugging and Application Developments Using Small Sized ProblemsUsing Small Sized Problems

Finish debugging or developing

Use semi-optimizedlibrary

Page 13: Towards  Auto-tuning Framework  for Numerical Libraries

Before Execute-time optimization

17

Perform BeforeExecute-timeoptimization

Specify parameters with end-user’s knowledge (e.g., problem sizes to execute)

Specified best parameters using user’s knowledge

Run-time optimizationRun-time optimization

Library is running

Library executioncall

CalcEigen(A,x,lamba,n)

Large-Scale ComputationLarge-Scale Computation

Use fully optimized library

Specify best parameters using the run-time parameter information

Page 14: Towards  Auto-tuning Framework  for Numerical Libraries

ABCLibScript:

18

6

Operating Systems

Communication Libraries (MPI)

LinearEquationsSolvers

Eigenvalue Solvers

Library Interface

HITACHI SR Fujitsu VPP NEC SXPC Clusters

Auto-tuningFacility

Auto-modeling Funct.

Code generation Funct.

Parameter Opt. Funct.

BLAS PerformanceParameters

OptimizationCodes & Info.

ImplementationInfo.

Scheduling & Computer Info.

Compilers

SparseDirectSolvers

Page 15: Towards  Auto-tuning Framework  for Numerical Libraries

!ABCLib$ install unroll (i) region start!ABCLib$ name MyMatMul!ABCLib$ varied (i) from 1 to 8!ABCLib$ debug (pp) do i=1, N do j=1, N da1 = A(i, j) do k=1, N dc = C(k, j) da1 = da1 + B(i, k) * dc enddo A(i, j) = da1 enddo enddo!ABCLib$ install unroll (i) region end

Unrolling Depth : Developer specifies using directive Ex. : Matrix-matrix multiplication code

19

Install-time optimization;

Unrolling process;

Unrolling Depth

Target Region(Auto-tuning Region)

Page 16: Towards  Auto-tuning Framework  for Numerical Libraries

if (i_unroll .eq. 1) then Original Code endifif (i_unroll .eq. 2) then /* i is dividable by 2 */ im = N/2 i = 1 do ii=1, im do j=1, N da1 = A(i, j); da2 = A(i+1,j) do k=1, N dc = C(k, j) da1 = da1 + B(i, k) * dc; da2 = da2 + B(i+1, k) * dc; enddo A(i, j) = da1; A(i+1,j) = da2 enddo i = i + 2; enddo

endif

After invocating pre-processor, the outer i loop is unrolled.

20

After code generation, After code generation, the depth of unrolling is the depth of unrolling is

automatically parameterized.automatically parameterized.

Page 17: Towards  Auto-tuning Framework  for Numerical Libraries

!ABCLib$ static select region start!ABCLib$ parameter (in CacheS, in NB, in NPrc) !ABCLib$ select sub region start!ABCLib$ according estimated!ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc)

     Target 1( Algorithm1 ) !ABC-LIB$ select sub region end!ABC-Lib$ select sub region start!ABC-Lib$ according estimated!ABC-Lib$ (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc) Target2 ( Algorithm2 )!ABC-LIB$ select sub region end!ABC-LIB$ static select region end

Selecting algorithms as follows:

21

Install-time Optimization; Selection Operation;  

Input Variables Used in Cost Definition Funct.Cost Definition Funct.

Selection Base on The Cost

Definition Function

Target Region1( Tuning Region1 )

Target Region2( Tuning Region2 )

Selection information for Selection information for Target 1 and 2 is parameterizedTarget 1 and 2 is parameterized.

Page 18: Towards  Auto-tuning Framework  for Numerical Libraries

Frank Matrix: Execution Time

Tim

e[se

c.]

#Proc.

Accu

racy

[Frob

en

ius

]

#Proc.

CG-S(1)CG-S(2)MG-SHG-SIR-CGSNoOrt.

Frank Matrix: Orithogonality

Required AccuracyFrom End-user

From 7x to 20x Speedups MG-S: Default

(with respect to numerical stability)

Page 19: Towards  Auto-tuning Framework  for Numerical Libraries

Target Application Matrix-Matrix Multiplication

ABCLibScript DirectiveUnroll operator only

Computer Environment Intel Pentium4 (2.0GHz), PGI compiler

SubjectsSubject A : Non-expertSubject B : Semi-expert (He knows block

algorithm.) Experiment term

2 weeks for hand tuning2 hours for ABCLibScript programming

23

Page 20: Towards  Auto-tuning Framework  for Numerical Libraries

24

4x Speedup

HIG

H

Subject A

Page 21: Towards  Auto-tuning Framework  for Numerical Libraries

25

Maximum2.5x speedup

HIG

H

Subject B

Page 22: Towards  Auto-tuning Framework  for Numerical Libraries

The performance was increased on between non-expert and semi-expert developers.

The development term was reduced from 2 weeks to 2 from 2 weeks to 2 hours hours with keeping better performance.

26

Page 23: Towards  Auto-tuning Framework  for Numerical Libraries

ABCLib:

27

Page 24: Towards  Auto-tuning Framework  for Numerical Libraries

6

Operating Systems

Communication Libraries (MPI)

LinearEquationsSolvers

Eigenvalue Solvers

Library Interface

HITACHI SR Fujitsu VPP NEC SXPC Clusters

Auto-tuningFacility

Auto-modeling Funct.

Code generation Funct.

Parameter Opt. Funct.

BLAS PerformanceParameters

OptimizationCodes & Info.

ImplementationInfo.

Scheduling & Computer Info.

Compilers

SparseDirectSolvers

ABCLib_DRESSED:

28

Page 25: Towards  Auto-tuning Framework  for Numerical Libraries

29

Automatically Blocking-and-Communication adjustment LIBrary

Timing for auto-tuning: Install-time Kernels for auto-tuning: about 30,000 lines.

1.1. Eigensolver (Real, Symmetric, Dense Eigensolver (Real, Symmetric, Dense matrix)matrix)

Householder Tridiagonalization (Tri)Householder Tridiagonalization (Tri)1. BLAS2 Unrolling Depth: Matrix-vector product ; 8 kinds;2. BLAS2 Unrolling Depth: Matrix updating process; 8 kinds;3. Communication Implementations: (One-to-one, Collective)

Householder Inverse Transformation (Inv)Householder Inverse Transformation (Inv) BLAS2 Unrolling Depth: Matrix updating process; 8 kinds; Communication Implementations:

(Blocking one-to-one, Non-blocking one-to-one, Collective)

2.2. QR Decomposition (Gram-Schmidt) QR Decomposition (Gram-Schmidt) 1. BLAS3 Unrolling Depths:

Matrix updating process; 4(outer) * 8(second) = 32 kinds * 2 parts;2. Block Length for Algorithm: From 1 to 8; 3. Communication Frequency (According to the block length)

Page 26: Towards  Auto-tuning Framework  for Numerical Libraries

30

Problem Size :6,123 ( SR/Sugg.)1,234 ( SR/no )

5,123 ( VPP/Sugg.)912 ( VPP/no )

5,123 ( PC/Sugg.)2,345 ( PC/no )

1.1—2.6 times :to default1.1 times : to Install-time

Execu

tion

time in

S

eco

nd

Page 27: Towards  Auto-tuning Framework  for Numerical Libraries

31

Problem Size :5,123 ( SR/Sugg.)2,345 ( SR/no )

6,123 ( VPP/Sugg.)912 ( VPP/no )

5,123 ( PC/Sugg.)2,345 ( PC/no )

1.2—3.5 times:to default

1.2—1.9 times:to Install-timeMax.3.4 times:to estimationfailed case

Execu

tion

time in

S

eco

nd

s

Page 28: Towards  Auto-tuning Framework  for Numerical Libraries

6

Operating Systems

Communication Libraries (MPI)

LinearEquationsSolvers

Eigenvalue Solvers

Library Interface

HITACHI SR Fujitsu VPP NEC SXPC Clusters

Auto-tuningFacility

Auto-modeling Funct.

Code generation Funct.

Parameter Opt. Funct.

BLAS PerformanceParameters

OptimizationCodes & Info.

ImplementationInfo.

Scheduling & Computer Info.

Compilers

SparseDirectSolvers

MS-MPI Auto-tuning project:

33

Page 29: Towards  Auto-tuning Framework  for Numerical Libraries

1. PC crusted with the Windows CCS 20032. Using MPI

Windows CCS 2003 provides MS-MPI

Assumption:

Nodes to be allocated are determined by scheduling policy on the Windows CCS 2003. The physical topology for the allocated node

affects communication performance. Communication pattern depends on

the distribution of zero elements for input matrices.

-> It is impossible to find the best communication implementation before the running!

Problem:

Page 30: Towards  Auto-tuning Framework  for Numerical Libraries

Logging for past calls is performed at run-time.

Main target: Sparse iterative solver. Same MPI function is called many times.

Communication implementation selection is performed at run-time.

1.1. Ring sending Ring sending vs.vs. Binary three Binary three sendingsending

2.2. SynchronousSynchronous vs. vs. Asynchronous Asynchronous 3.3. Overlapping Overlapping vs.vs. Non-overlapping Non-overlapping4.4. Recursive halvingRecursive halving vs. vs. NormalNormal

Final goal: Implementing a MPI lapper No modification of codes for end-user.No modification of codes for end-user.

Page 31: Towards  Auto-tuning Framework  for Numerical Libraries

•Target Application•Parallel Sparse Iterative solver (GMRES Method)

•Developed by Dr. H.Kuroda (U. of Tokyo)•Following performance parameters are auto-tuned according to input matrix:

1.1.Selection of preconditioner Selection of preconditioner (Scaling, (Scaling, Jacobi, …)Jacobi, …)2.2.Adjustment of loop unrolling depth Adjustment of loop unrolling depth for sparse matrix multiplicationfor sparse matrix multiplication3.3.Selection of MPISelection of MPI implementations implementations (Gather, Overlap, Collective matter, …)(Gather, Overlap, Collective matter, …)

•Experimental environment•Microsoft Innovation Center (MIC) at Chou-fu•AMD Athelon 64 X2 Dual, Cell Processor 3800+ (2.01GHz,2GByte RAM)•Windows CCS, MS-MPI, Visual Studio2005 C++

Page 32: Towards  Auto-tuning Framework  for Numerical Libraries

The Toeplitz Matrix 5 Points Deference Matrix

Maximum 20x speedupMaximum 20x speedup

Page 33: Towards  Auto-tuning Framework  for Numerical Libraries

SaNS (Self-adapting Numerical Software) Project@ University of Tennessee at Knoxville SaNS Agent :

Provide intelligent components for the behavior of data, algorithms, and systems

Adapt computational Grid Provide data repository for performance data

Provide a simple scripting language BeBOP (Berkeley Benchmarking and

Optimization Group) Project @ University of California at Berkeley OSKI : Optimized Sparse Kernel InterfaceOSKI : Optimized Sparse Kernel Interface

A collection of low-level primitives that provide automatically tuned computational kernels on sparse matrices, for use by solver libraries and applications.

SPIRAL Project @ Carnegie Mellon University Software/Hardware Generation for DSP algorithmSoftware/Hardware Generation for DSP algorithm

38

Page 34: Towards  Auto-tuning Framework  for Numerical Libraries

To establish high productivity on numerical libraries, auto-tuning facility is needed.FIBERFIBER is one of the promising frameworks for

establishing high productivity.ABCLibScript ABCLibScript is the computer language to

describe auto-tuning process based on FIBER for general applications.

Next generation supercomputers must have..complicated architectures (multicore,…)more than 10,000 processors-> we need somehow intelligent and automated tuning systems.

39

Page 35: Towards  Auto-tuning Framework  for Numerical Libraries

Auto-Tuning Research Group in JAPANChair: Toshitsugu Yuba (U. of Electro-comm.)Vice Chair: Takahiro Katagiri (U. of Tokyo)Reiji Suda (U. of Tokyo) Toshiyuki Imamura (U. of Electro-comm.)Yusaku Yamamoto (Nagoya U.)Ken Naono (HITACHI Ltd.)Kentaro Shimizu (U. of Tokyo)Hiroyuki Sato (U. of Tokyo)Shoji Ito (RIKEN)Takeshi Iwashita (Kyoto U.)Kazuya Terauchi (Japan Visual Numerics Inc.)Masashi Egi (HITACHI Ltd.)Takao Sakurai (HITACHI Ltd.)Hisayasu Kuroda (U. of Tokyo)

40

Page 36: Towards  Auto-tuning Framework  for Numerical Libraries

If you are interested in ABCLib project, please visit: http://www.abc-lib.org/

41